Detailed system architecture and component documentation for Discogsography
Discogsography is built as a microservices platform that processes large-scale music data from Discogs and transforms it into queryable knowledge graphs and relational databases. The architecture emphasizes scalability, reliability, and performance.
| Service | Purpose | Key Technologies | Port(s) |
|---|---|---|---|
| π API | User auth, graph queries, and sync triggers | FastAPI, psycopg3, redis, Discogs OAuth 1.0 |
8004 (ext), 8005 |
| β‘ Extractor | High-performance Rust-based extractor (runs as extractor-discogs and extractor-musicbrainz Docker services) |
tokio, quick-xml, lapin |
8000 (health) |
| π§ Schema-Init | One-shot DB schema initializer | neo4j-driver, psycopg3 |
β |
| π Graphinator | Builds Neo4j knowledge graphs | neo4j-driver, graph algorithms |
8001 (health) |
| π Tableinator | Creates PostgreSQL analytics tables | psycopg3, JSONB, full-text search |
8002 (health) |
| π Explore | Static frontend files and health check | FastAPI, Tailwind CSS, Alpine.js, D3.js, Plotly.js |
8006, 8007 (internal) |
| π Dashboard | Real-time monitoring and admin panel | FastAPI, WebSocket, reactive UI, httpx |
8003 (ext) |
| π Insights | Precomputed analytics and music trends | FastAPI, psycopg3, httpx |
8008, 8009 (internal) |
| π€ MCP Server | Exposes knowledge graph to AI assistants | FastMCP, httpx |
stdio / streamable-http |
| Service | Purpose | Key Technologies | Port(s) |
|---|---|---|---|
| π§ Brainzgraphinator | Enriches Neo4j graph with MusicBrainz metadata and relationships | neo4j-driver, pika |
8011 (health) |
| 𧬠Brainztableinator | Stores all MusicBrainz data in PostgreSQL | psycopg3, pika |
8010 (health) |
| Component | Purpose | Port(s) |
|---|---|---|
| π° RabbitMQ | Message broker and queue management | 5672, 15672 |
| π Neo4j | Graph database for relationships | 7474, 7687 |
| π PostgreSQL | Relational database for analytics | 5433 (mapped) |
| π΄ Redis | Cache layer for queries, sessions, and analytics | 6379 |
Shows the ingestion flow from Discogs and MusicBrainz data dumps through extraction, message distribution, and persistence into both databases.
graph TD
S3[("π Discogs S3<br/>Monthly Data Dumps<br/>~11.3GB XML")]
MB[("π΅ MusicBrainz<br/>JSONL Dumps<br/>Twice Weekly")]
SCHEMA[["π§ Schema-Init<br/>One-Shot DB<br/>Schema Initialiser"]]
EXT_D[["β‘ Extractor<br/>--source discogs<br/>XML Processing"]]
EXT_MB[["β‘ Extractor<br/>--source musicbrainz<br/>JSONL Processing"]]
RMQ{{"π° RabbitMQ 4.x<br/>Message Broker<br/>8 Fanout Exchanges"}}
NEO4J[("π Neo4j 2026<br/>Graph Database<br/>Relationships")]
PG[("π PostgreSQL 18<br/>Analytics DB<br/>Full-text Search")]
GRAPH[["π Graphinator<br/>Graph Builder"]]
TABLE[["π Tableinator<br/>Table Builder"]]
BGRAPH[["π§ Brainzgraphinator<br/>Neo4j Enrichment"]]
BTABLE[["𧬠Brainztableinator<br/>PostgreSQL Storage"]]
SCHEMA -->|0. Create schemas| NEO4J
SCHEMA -->|0. Create schemas| PG
S3 -->|1a. Download & Parse XML| EXT_D
MB -->|1b. Parse JSONL| EXT_MB
EXT_D -->|2a. 4 Discogs exchanges| RMQ
EXT_MB -->|2b. 4 MB exchanges| RMQ
RMQ -->|3a. Artists/Labels/Releases/Masters| GRAPH
RMQ -->|3b. Artists/Labels/Releases/Masters| TABLE
RMQ -->|3c. MB Artists/Labels/Release-Groups/Releases| BGRAPH
RMQ -->|3d. MB Artists/Labels/Release-Groups/Releases| BTABLE
GRAPH -->|4a. Build Graph| NEO4J
TABLE -->|4b. Store Data| PG
BGRAPH -->|4c. Enrich Nodes| NEO4J
BTABLE -->|4d. Store MB Data| PG
style S3 fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style MB fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style SCHEMA fill:#f9fbe7,stroke:#827717,stroke-width:2px
style EXT_D fill:#ffccbc,stroke:#d84315,stroke-width:2px
style EXT_MB fill:#ffccbc,stroke:#d84315,stroke-width:2px
style RMQ fill:#fff3e0,stroke:#e65100,stroke-width:2px
style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
style GRAPH fill:#e0f2f1,stroke:#004d40,stroke-width:2px
style TABLE fill:#fce4ec,stroke:#880e4f,stroke-width:2px
style BGRAPH fill:#e0f2f1,stroke:#004d40,stroke-width:2px
style BTABLE fill:#fce4ec,stroke:#880e4f,stroke-width:2px
Shows how user-facing services interact with each other and with the storage layer at runtime.
graph TD
NEO4J[("π Neo4j 2026<br/>Graph Database")]
PG[("π PostgreSQL 18<br/>Analytics DB")]
REDIS[("π΄ Redis<br/>Cache Layer")]
EXPLORE[["π Explore<br/>Graph Explorer<br/>Trends & Paths"]]
API[["π API<br/>User Auth<br/>JWT & OAuth"]]
INSIGHTS[["π Insights<br/>Precomputed Analytics<br/>Music Trends"]]
DASH[["π Dashboard<br/>Real-time Monitor<br/>WebSocket"]]
EXPLORE -.->|Proxy /api/*| API
API -.->|User Accounts| PG
API -.->|Graph Queries| NEO4J
API -.->|OAuth State + Snapshots| REDIS
API -.->|Proxy /api/insights/*| INSIGHTS
INSIGHTS -.->|Fetch /api/internal/*| API
INSIGHTS -.->|Store Results| PG
INSIGHTS -.->|Cache Results| REDIS
DASH -.->|Cache| REDIS
DASH -.->|Stats| NEO4J
DASH -.->|Stats| PG
style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
style REDIS fill:#ffebee,stroke:#b71c1c,stroke-width:2px
style EXPLORE fill:#e8eaf6,stroke:#283593,stroke-width:2px
style API fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px
style INSIGHTS fill:#fff9c4,stroke:#f57f17,stroke-width:2px
style DASH fill:#fce4ec,stroke:#880e4f,stroke-width:2px
Shows the Dashboard service's monitoring connections to pipeline services and infrastructure. The Dashboard monitors pipeline services grouped by source (Discogs and MusicBrainz). It does not monitor API, Explore, or Insights health.
graph TD
DASH[["π Dashboard<br/>Real-time Monitor<br/>WebSocket"]]
subgraph Discogs ["Discogs Pipeline"]
EXT_D[["β‘ Extractor Discogs"]]
GRAPH[["π Graphinator"]]
TABLE[["π Tableinator"]]
end
subgraph MB ["MusicBrainz Pipeline"]
EXT_MB[["β‘ Extractor MB"]]
BGRAPH[["π§ Brainzgraphinator"]]
BTABLE[["𧬠Brainztableinator"]]
end
RMQ{{"π° RabbitMQ"}}
NEO4J[("π Neo4j")]
PG[("π PostgreSQL")]
REDIS[("π΄ Redis")]
DASH -.->|Monitor| EXT_D
DASH -.->|Monitor| GRAPH
DASH -.->|Monitor| TABLE
DASH -.->|Monitor| EXT_MB
DASH -.->|Monitor| BGRAPH
DASH -.->|Monitor| BTABLE
DASH -.->|Stats| RMQ
DASH -.->|Stats| NEO4J
DASH -.->|Stats| PG
DASH -.->|Cache| REDIS
style DASH fill:#fce4ec,stroke:#880e4f,stroke-width:2px
style EXT_D fill:#ffccbc,stroke:#d84315,stroke-width:2px
style EXT_MB fill:#ffccbc,stroke:#d84315,stroke-width:2px
style GRAPH fill:#e0f2f1,stroke:#004d40,stroke-width:2px
style TABLE fill:#fce4ec,stroke:#880e4f,stroke-width:2px
style BGRAPH fill:#e0f2f1,stroke:#004d40,stroke-width:2px
style BTABLE fill:#fce4ec,stroke:#880e4f,stroke-width:2px
style RMQ fill:#fff3e0,stroke:#e65100,stroke-width:2px
style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
style REDIS fill:#ffebee,stroke:#b71c1c,stroke-width:2px
Shows how the MCP server connects AI assistants to the knowledge graph through the API service.
graph LR
AI["π€ AI Assistant<br/>(Claude, Cursor, Zed)"]
MCP[["π€ MCP Server<br/>11 tools<br/>stdio / HTTP"]]
API[["π API<br/>FastAPI"]]
NEO4J[("π Neo4j")]
PG[("π PostgreSQL")]
REDIS[("π΄ Redis")]
AI <-->|MCP Protocol| MCP
MCP -->|httpx| API
API --- NEO4J & PG & REDIS
style AI fill:#e8eaf6,stroke:#283593,stroke-width:2px
style MCP fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px
style API fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px
style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
style REDIS fill:#ffebee,stroke:#b71c1c,stroke-width:2px
Extractor (Rust-based, two modes):
- Discogs mode (
--source discogs): Downloads XML dumps from Discogs S3 bucket, high-performance XML parsing (20,000-400,000+ records/sec), SHA256 hash-based deduplication - MusicBrainz mode (
--source musicbrainz): Parses MusicBrainz JSONL dumps (xz-compressed), extracts Discogs IDs from URL relationships, publishes to MusicBrainz-specific fanout exchanges - Both modes publish JSON messages to per-data-type RabbitMQ fanout exchanges
- Each mode runs as a separate container with its own state markers
RabbitMQ Fanout Exchanges (8 total, one per data type per source, decoupled from consumers):
Discogs exchanges (4):
discogsography-discogs-artists: Artist and band datadiscogsography-discogs-labels: Record label informationdiscogsography-discogs-releases: Release recordsdiscogsography-discogs-masters: Master recording data
MusicBrainz exchanges (4):
discogsography-musicbrainz-artists: MusicBrainz artist data with Discogs cross-referencesdiscogsography-musicbrainz-labels: MusicBrainz label data with Discogs cross-referencesdiscogsography-musicbrainz-release-groups: MusicBrainz release-group data with Discogs master cross-referencesdiscogsography-musicbrainz-releases: MusicBrainz release data with Discogs cross-references
Each consumer independently declares and binds its own queues to these exchanges.
Message Types:
dataβ Individual records with SHA256 hashfile_completeβ Sent per data type when a file finishes processingextraction_completeβ Sent once to all 4 exchanges after all files finish, carryingstarted_attimestamp and per-type record counts
{
"type": "data",
"id": "<record_id>",
"sha256": "<64-char hex hash>",
...entity-specific fields
}See Database Schema β Extractor Message Format for detailed examples.
Graphinator (Neo4j):
- Consumes messages from all 4 queues
- Creates nodes: Artist, Label, Release, Master, Genre, Style
- Builds relationships: BY, ON, MEMBER_OF, DERIVED_FROM, etc.
- On
extraction_complete: deletes stub nodes (nosha256property) created by cross-type MERGE operations - Post-import computation (after releases complete): Runs
compute_genre_style_stats()to pre-compute aggregate properties on Genre and Style nodes (release_count, artist_count, label_count, style_count/genre_count, first_year). UsesCALL {} IN TRANSACTIONS OF 1 ROWSto process each node in its own transaction (avoids 120s timeout for mega-genres like Rock/Electronic). These properties replace expensive runtime traversals (~200M DB hits β 6 DB hits per query).
Tableinator (PostgreSQL):
- Consumes messages from all 4 queues
- Stores JSONB documents in relational tables; always refreshes
updated_at, only rewrites data when hash differs - Creates indexes for fast queries
- On
extraction_complete: purges stale rows whereupdated_at < started_at
See Database Schema β Post-Extraction Cleanup for details.
Brainzgraphinator (Neo4j enrichment):
- Consumes messages from 4 MusicBrainz queues (artists, labels, release-groups, releases)
- Enriches existing Discogs nodes with
mb_-prefixed properties (type, gender, dates, area, disambiguation) - Creates 8 new relationship edge types between Discogs-matched entities (COLLABORATED_WITH, TAUGHT, TRIBUTE_TO, FOUNDED, SUPPORTED, SUBGROUP_OF, RENAMED_TO, enriched MEMBER_OF)
- All MB-sourced edges carry
source: 'musicbrainz'for provenance tracking - Skips entities without a Discogs match β only enriches nodes already in the graph
- Idempotent:
MATCH...SETfor metadata,MERGEfor edges β safe for re-import
Brainztableinator (PostgreSQL):
- Consumes messages from 4 MusicBrainz queues (artists, labels, release-groups, releases)
- Stores all MusicBrainz entities in the
musicbrainzPostgreSQL schema β including entities without Discogs matches - Records artist-to-artist relationships (collaborations, band membership, etc.)
- Stores external links (Wikipedia, Wikidata, AllMusic, Last.fm, IMDb)
- Idempotent via
ON CONFLICT DO UPDATE/NOTHING
See MusicBrainz Sync Guide for operational instructions.
API Service (graph query endpoints):
- Interactive graph exploration (
/api/explore,/api/expand) - Trend analysis and pattern discovery (
/api/trends) - Entity autocomplete and node detail lookup (
/api/autocomplete,/api/node/{id}) - User collection and wantlist queries (
/api/user/collection,/api/user/wantlist) - Collection gap analysis (
/api/collection/gaps/label/{id},/api/collection/gaps/artist/{id},/api/collection/gaps/master/{id}) - Graph snapshot save/restore (
/api/snapshot)
API Service (new feature endpoints):
- Path finder (
/api/path) - Unified full-text search (
/api/search) - Label DNA fingerprinting and comparison (
/api/label/{label_id}/dna,/api/label/{label_id}/similar,/api/label/dna/compare) - Taste fingerprint analytics (
/api/user/taste/*) - Vinyl Archaeology time-travel filtering (
/api/explore/year-range,/api/explore/genre-emergence,before_yearparameter on/api/expand) - Collection timeline evolution (
/api/user/collection/timeline,/api/user/collection/evolution) - Recommendation engine (
/api/recommend/similar/artist/{artist_id}β find similar artists via shared genres/styles,/api/recommend/explore/{entity_type}/{entity_id}β explore-from-here discovery) - Collaborator network (
/api/collaborators/{artist_id}β artists sharing releases, with temporal breakdown) - Collaboration network analysis (
/api/network/artist/{id}/collaboratorsβ multi-hop collaborator traversal,/api/network/artist/{id}/centralityβ degree and collaboration centrality,/api/network/cluster/{id}β community detection via genre clustering) - Genre tree hierarchy (
/api/genre-treeβ genre/style tree derived from release co-occurrence) - Graph statistics (
/api/graph/statsβ aggregate node counts across all entity types) - MusicBrainz enrichment status (
/api/enrichment/statusβ coverage statistics for MB-enriched entities) - MusicBrainz artist metadata (
/api/musicbrainz/artist/{artist_id}β MB properties for a Discogs artist) - MusicBrainz artist relationships (
/api/musicbrainz/artist/{artist_id}/relationshipsβ MB-sourced edges) - MusicBrainz artist external links (
/api/musicbrainz/artist/{artist_id}/external-linksβ Wikipedia, Wikidata, etc.)
MCP Server (AI assistant integration):
- Thin HTTP client that proxies all 11 tools through the API service β no direct database access
- Tools: search, entity details (artist/label/release/genre/style), path finder, trends, graph stats, collaborators, genre tree
- Transports: stdio (Claude Desktop, Cursor, Zed) or streamable-http (hosted)
Insights Service (precomputed analytics):
- Scheduled batch analytics fetched from API internal endpoints over HTTP (configurable interval, default: 24h)
- Top artists by graph centrality (
/api/insights/top-artists) - Genre trends by decade (
/api/insights/genre-trends) - Label longevity rankings (
/api/insights/label-longevity) - Monthly release anniversaries (
/api/insights/this-month) - Data completeness scores (
/api/insights/data-completeness) - Computation status monitoring (
/api/insights/status)
Explore Service (static frontend):
- Serves the D3.js force-directed graph UI and Plotly.js trends frontend
- All graph query API calls are made from the browser to the API service (port 8004)
Dashboard Service:
- Real-time WebSocket updates
- System health monitoring
- Queue metrics and processing rates
- Interactive visualizations
Responsibilities:
- Discogs mode (
--source discogs): Download XML dumps from S3, parse, deduplicate, publish to 4 fanout exchanges - MusicBrainz mode (
--source musicbrainz): Parse JSONL dumps, extract Discogs IDs, publish to 4 fanout exchanges - Validate checksums and metadata
- Track progress via version-specific state markers
Key Features:
- Async Rust with Tokio runtime
- 20,000-400,000+ records/sec processing (Discogs XML)
- Memory-efficient streaming parsers for both XML and JSONL
- Periodic update checks (configurable interval)
- Smart file completion tracking
- Automatic retry with exponential backoff
- Separate state markers per source:
.extraction_status_*.json(Discogs) and.mb_extraction_status_*.json(MusicBrainz)
Configuration:
DISCOGS_ROOT/MUSICBRAINZ_ROOT: Data storage directoriesPERIODIC_CHECK_DAYS: Update check intervalRABBITMQ_HOST: RabbitMQ hostnameRABBITMQ_USERNAME,RABBITMQ_PASSWORD: RabbitMQ auth credentialsDISCOGS_EXCHANGE_PREFIX: Exchange name prefix (default:discogsography-discogsfor Discogs,discogsography-musicbrainzfor MB)
See Extractor README for details.
Responsibilities:
- Create all Neo4j constraints and indexes on first run
- Create all PostgreSQL tables and indexes on first run
- Run as a one-shot init container before any other service starts
- All DDL uses
IF NOT EXISTSβ safe to re-run, never drops schema objects
Key Features:
- Idempotent: re-running on an already-initialized database is a no-op
- Single source of truth for both Neo4j and PostgreSQL schema definitions
- Schema definitions live in
schema-init/neo4j_schema.pyandschema-init/postgres_schema.py - Parallel initialization: Neo4j and PostgreSQL schema creation run concurrently
- Exits 0 on success, 1 on any failure (so dependent services will not start)
Configuration:
NEO4J_HOST,NEO4J_USERNAME,NEO4J_PASSWORD: Neo4j connectionPOSTGRES_HOST,POSTGRES_USERNAME,POSTGRES_PASSWORD,POSTGRES_DATABASE: PostgreSQL connection
Responsibilities:
- Build Neo4j knowledge graph
- Create nodes and relationships
- Maintain graph indexes
- Handle schema evolution
- Pre-compute aggregate statistics on Genre/Style nodes after release import
Key Features:
- Automatic relationship detection
- Batch transaction processing
- Connection resilience with retry logic
- Smart consumer lifecycle management
- Post-import aggregation:
compute_genre_style_stats()sets 5 pre-computed properties (release_count, artist_count, label_count, style_count/genre_count, first_year) on each Genre and Style node usingCALL {} IN TRANSACTIONS OF 1 ROWS
Configuration:
NEO4J_HOST: Neo4j bolt URLNEO4J_USERNAME,NEO4J_PASSWORD: Auth credentialsCONSUMER_CANCEL_DELAY: Idle timeout before shutdown
See Graphinator README for details.
Responsibilities:
- Store data in PostgreSQL
- Create and maintain indexes
- Handle JSONB documents
- Enable full-text search
Key Features:
- JSONB for flexible schema
- GIN indexes for fast queries
- Batch insert optimization
- Connection pool management
Configuration:
POSTGRES_HOST: PostgreSQL host:portPOSTGRES_USERNAME,POSTGRES_PASSWORD: Auth credentialsPOSTGRES_DATABASE: Database name
See Tableinator README for details.
Responsibilities:
- Enrich existing Neo4j nodes with MusicBrainz metadata
- Create new relationship edges between Discogs-matched entities
- Track enrichment statistics (entities enriched, skipped, relationships created)
Key Features:
- Enriches Artist, Label, Release, and Master nodes with
mb_-prefixed properties (mbid, type, gender, dates, area, disambiguation, secondary_types, first_release_date) - Creates 8 relationship edge types: MEMBER_OF (enriched), COLLABORATED_WITH, TAUGHT, TRIBUTE_TO, FOUNDED, SUPPORTED, SUBGROUP_OF, RENAMED_TO
- All MB-sourced edges carry
source: 'musicbrainz'provenance - Discogs-matched entities only β skips entities without a Discogs ID in the MB data
- Both sides required for edges β relationships only created when both entities exist in Neo4j
- Smart connection lifecycle: auto-close when idle, periodic queue checks, auto-reconnect
- Idempotent writes: safe for re-import
Configuration:
NEO4J_HOST,NEO4J_USERNAME,NEO4J_PASSWORD: Neo4j connectionRABBITMQ_HOST,RABBITMQ_USERNAME,RABBITMQ_PASSWORD: RabbitMQ connectionCONSUMER_CANCEL_DELAY: Idle timeout before consumer cancellation (default: 300s)
See Brainzgraphinator README for details.
Responsibilities:
- Store all MusicBrainz data in PostgreSQL
musicbrainzschema - Record entity relationships and external links
- Maintain data integrity with MBID-based primary keys
Key Features:
- Stores artists, labels, release-groups, and releases with structured columns plus JSONB
datafor full record - Records relationships (collaborations, band membership, etc.) with source/target MBIDs
- Stores external links (Wikipedia, Wikidata, AllMusic, Last.fm, IMDb) per entity
- Stores all entities β including those without Discogs matches (available for future use)
ON CONFLICT DO UPDATE/NOTHINGfor idempotent processing- Smart connection lifecycle: auto-close when idle, periodic queue checks, auto-reconnect
Configuration:
POSTGRES_HOST,POSTGRES_USERNAME,POSTGRES_PASSWORD,POSTGRES_DATABASE: PostgreSQL connectionRABBITMQ_HOST,RABBITMQ_USERNAME,RABBITMQ_PASSWORD: RabbitMQ connectionCONSUMER_CANCEL_DELAY: Idle timeout before consumer cancellation (default: 300s)
See Brainztableinator README for details.
Responsibilities:
- Serve the interactive graph exploration frontend (Tailwind CSS, Alpine.js, D3.js, Plotly.js)
- Provide a health check endpoint
- All graph query API endpoints are routed through the API service
Key Features:
- FastAPI static file serving (HTML, JS, CSS)
- Tailwind CSS dark theme with Alpine.js reactive UI
- D3.js force-directed graph and Plotly.js trends visualizations
- Internal-only (not externally exposed in Docker Compose)
Configuration:
API_BASE_URL: URL of the API service to proxy graph query requests (default:http://api:8004)CORS_ORIGINS: Optional comma-separated list of allowed CORS origins
See Explore README for details.
Responsibilities:
- Real-time system monitoring
- WebSocket-based live updates
- Service health checks
- Queue metrics visualization
- Admin panel (login-gated) for extraction management and DLQ operations
Key Features:
- FastAPI backend
- WebSocket for real-time data
- Responsive HTML/CSS/JS frontend
- Activity log and event tracking
- Admin proxy router β forwards authenticated admin requests to the API service
- Extraction trigger (forces full reprocessing) and history table
- Dead-letter queue purge interface
Configuration:
- Service health endpoint URLs
- Database connection strings
- RabbitMQ management API access
API_HOST/API_PORTβ API service connection for admin proxy
See Dashboard README for details.
Responsibilities:
- Run scheduled batch analytics by fetching raw query data from the API service over HTTP
- Compute artist centrality, genre trends, label longevity, anniversaries, and data completeness
- Store precomputed results in PostgreSQL
insights.*tables - Serve analytics via read-only HTTP endpoints
Key Features:
- FastAPI backend with async PostgreSQL and httpx (API client)
- Configurable scheduler interval (default: 24 hours)
- 5 computation types running sequentially
- Redis caching with cache-aside pattern (TTL matches schedule interval, invalidated after computation)
- Separate health server on port 8009
- Results proxied through the API service at
/api/insights/*
Configuration:
API_BASE_URL: URL of the API service for fetching raw query data over HTTPPOSTGRES_HOST,POSTGRES_USERNAME,POSTGRES_PASSWORD,POSTGRES_DATABASE: PostgreSQL connectionINSIGHTS_SCHEDULE_HOURS: Computation interval in hours (default: 24)REDIS_HOST: Redis hostname for result cachingINSIGHTS_MILESTONE_YEARS: Configurable anniversary years to highlight
See Insights README for details.
Responsibilities:
- User registration and authentication (
/api/auth/*) - Self-service password reset with Redis-backed tokens (
/api/auth/reset-*) - Optional TOTP two-factor authentication (
/api/auth/2fa/*) - JWT token generation and validation (HS256)
- Discogs OAuth 1.0a OOB flow management (
/api/oauth/*) - Discogs OAuth token storage and retrieval
- Graph query endpoints (
/api/autocomplete,/api/explore,/api/expand,/api/node/{id},/api/trends) - User collection and wantlist queries (
/api/user/collection,/api/user/wantlist,/api/user/recommendations,/api/user/collection/stats,/api/user/status) - Collection gap analysis (
/api/collection/gaps/{type}/{id},/api/collection/formats) - Collection and wantlist sync (
/api/sync,/api/sync/status) - Graph snapshot save/restore (
/api/snapshot,/api/snapshot/{token}) - Recommendation endpoints (
/api/recommend/similar/artist/{artist_id},/api/recommend/explore/{entity_type}/{entity_id}) - Label DNA endpoints (
/api/label/{id}/dna,/api/label/{id}/similar,/api/label/dna/compare) - Taste fingerprint endpoints (
/api/user/taste/*) - Unified full-text search (
/api/search) - Collection timeline and evolution (
/api/user/collection/timeline,/api/user/collection/evolution) - Vinyl Archaeology endpoints (
/api/explore/year-range,/api/explore/genre-emergence) - Path finder (
/api/path) - Reads Discogs app credentials from
app_configtable (set viadiscogs-setupCLI)
Key Features:
- FastAPI backend with async PostgreSQL
- PBKDF2-SHA256 password hashing (100,000 iterations)
- Stateless JWT authentication using shared
JWT_SECRET_KEY - Redis-backed OAuth state storage with TTL
- Token-protected endpoints for all user operations
- Self-service password reset (Redis tokens, 15min TTL, anti-enumeration)
- Optional TOTP 2FA with pyotp (QR code setup, recovery codes, brute-force lockout)
- HKDF-SHA256 key derivation for per-purpose encryption (OAuth tokens, TOTP secrets)
- Brevo transactional email integration (optional β falls back to log output)
Configuration:
JWT_SECRET_KEY: Shared secret for HS256 token signingPOSTGRES_HOST,POSTGRES_USERNAME,POSTGRES_PASSWORD: PostgreSQL connectionREDIS_HOST: Redis connection for OAuth stateDISCOGS_USER_AGENT: User-Agent header for Discogs API calls
See API README for details.
graph LR
subgraph Producers
EXT[Extractor<br/>--source discogs]
end
subgraph RabbitMQ
subgraph Fanout Exchanges
AX[discogsography-discogs-artists]
LX[discogsography-discogs-labels]
RX[discogsography-discogs-releases]
MX[discogsography-discogs-masters]
end
subgraph Graphinator Queues
GAQ[graphinator-artists]
GLQ[graphinator-labels]
GRQ[graphinator-releases]
GMQ[graphinator-masters]
end
subgraph Tableinator Queues
TAQ[tableinator-artists]
TLQ[tableinator-labels]
TRQ[tableinator-releases]
TMQ[tableinator-masters]
end
end
subgraph Consumers
GRAPH[Graphinator]
TABLE[Tableinator]
end
EXT --> AX & LX & RX & MX
AX --> GAQ & TAQ
LX --> GLQ & TLQ
RX --> GRQ & TRQ
MX --> GMQ & TMQ
GAQ & GLQ & GRQ & GMQ --> GRAPH
TAQ & TLQ & TRQ & TMQ --> TABLE
style EXT fill:#ffccbc,stroke:#d84315
style GRAPH fill:#f3e5f5,stroke:#4a148c
style TABLE fill:#e8f5e9,stroke:#1b5e20
graph LR
subgraph Producers
EXT_MB[Extractor<br/>--source musicbrainz]
end
subgraph RabbitMQ
subgraph MB Fanout Exchanges
MAQ[discogsography-musicbrainz-artists]
MLQ[discogsography-musicbrainz-labels]
MRGQ[discogsography-musicbrainz-release-groups]
MRQ[discogsography-musicbrainz-releases]
end
subgraph Brainzgraphinator Queues
BGA[brainzgraphinator-artists]
BGL[brainzgraphinator-labels]
BGRG[brainzgraphinator-release-groups]
BGR[brainzgraphinator-releases]
end
subgraph Brainztableinator Queues
BTA[brainztableinator-artists]
BTL[brainztableinator-labels]
BTRG[brainztableinator-release-groups]
BTR[brainztableinator-releases]
end
end
subgraph Consumers
BGRAPH[Brainzgraphinator]
BTABLE[Brainztableinator]
end
EXT_MB --> MAQ & MLQ & MRGQ & MRQ
MAQ --> BGA & BTA
MLQ --> BGL & BTL
MRGQ --> BGRG & BTRG
MRQ --> BGR & BTR
BGA & BGL & BGRG & BGR --> BGRAPH
BTA & BTL & BTRG & BTR --> BTABLE
style EXT_MB fill:#ffccbc,stroke:#d84315
style BGRAPH fill:#f3e5f5,stroke:#4a148c
style BTABLE fill:#e8f5e9,stroke:#1b5e20
- Durability: All queues are durable (survive broker restart)
- Persistence: Messages persisted to disk
- Prefetch: Configurable per consumer (default: 100)
- Dead Letter: Failed messages routed to DLX
- TTL: No message expiration (process all data)
- Active Processing: Consuming and processing messages
- Idle Detection: All queues empty, no messages for 5 minutes
- Connection Cleanup: Close RabbitMQ connections
- Periodic Checking: Check queues every hour for new messages
- Auto-Reconnection: Restart consumers when new data arrives
See Consumer Cancellation for details.
Purpose: Store and query complex music relationships
Node Types:
- Artist (musicians, bands, producers)
- Label (record labels, imprints)
- Master (master recordings)
- Release (physical/digital releases)
- Genre (musical genres)
- Style (sub-genres, styles)
- User (authenticated Discogs users)
Relationship Types:
- BY (release β artist)
- ON (release β label)
- DERIVED_FROM (release β master)
- IS (release β genre/style)
- MEMBER_OF (artist β band)
- ALIAS_OF (artist alias β primary artist)
- SUBLABEL_OF (label β parent label)
- PART_OF (style β genre)
- COLLECTED (user β release)
- WANTS (user β release)
MusicBrainz-sourced Relationships (all carry source: 'musicbrainz'):
- COLLABORATED_WITH (artist β artist)
- TAUGHT (teacher β student)
- TRIBUTE_TO (tribute act β original)
- FOUNDED (person β group)
- SUPPORTED (supporter β main artist)
- SUBGROUP_OF (subgroup β parent)
- RENAMED_TO (old β new)
See Database Schema for details.
Purpose: Fast structured queries and analytics
Tables:
artists: Artist data in JSONB formatlabels: Label data in JSONB formatmasters: Master recording datareleases: Release data with full-text indexesinsights.artist_centrality: Top artists by graph centralityinsights.genre_trends: Genre release counts by decadeinsights.label_longevity: Labels ranked by years activeinsights.monthly_anniversaries: Notable release anniversariesinsights.data_completeness: Data quality metrics per entity typeinsights.computation_log: Audit log of computation runs
MusicBrainz Schema (musicbrainz.*):
musicbrainz.artists: MBID, name, type, gender, dates, area, Discogs cross-referencemusicbrainz.labels: MBID, name, type, label code, dates, Discogs cross-referencemusicbrainz.releases: MBID, name, barcode, status, Discogs cross-referencemusicbrainz.relationships: Source/target MBIDs, relationship type, direction, attributesmusicbrainz.external_links: MBID, service name, URL (Wikipedia, Wikidata, AllMusic, etc.)
Indexes:
- B-tree indexes on common query fields
- GIN indexes on JSONB columns
- Full-text search indexes
- Filtered indexes on Discogs ID columns for MusicBrainz cross-reference lookups
See Database Schema for details.
Purpose: Cache query results and OAuth state
Cache Types:
- OAuth state tokens (API β short TTL, used during Discogs OAuth flow)
- Graph snapshots (API β native Redis TTL, default 28 days, survives service restarts)
- JWT revocation blacklist (API β JTI claims with TTL matching token expiry)
- Insights computation results (Insights β TTL matches schedule interval, invalidated after each run)
- Query result caching (Dashboard)
- Dashboard metrics
- API query result caching (cache-aside pattern):
| Cache Key Pattern | TTL | Endpoints Covered |
|---|---|---|
trends:{type}:{name} |
24h | /api/trends?type=genre|style |
label-dna:{label_id} |
24h | /api/label/{id}/dna |
label-similar:{label_id}:{limit} |
24h | /api/label/{id}/similar |
recommend:similar:artist:{artist_id} |
24h | /api/recommend/similar/artist/{artist_id} |
explore-artist:{name} |
24h | /api/explore?type=artist |
trends-label:{label_id} |
24h | /api/trends?type=label |
search:{md5_digest} |
5m | /api/search |
Configuration:
- Default TTL: varies by cache type (see table above)
- Max memory: Configurable
- Eviction policy: LRU
- Non-root users (UID 1000)
- Read-only root filesystems
- Dropped capabilities
- No new privileges flag
- Resource limits (CPU, memory)
See Docker Security for details.
- No external ports exposed (except dashboards)
- Internal Docker network for services
- Encrypted connections to databases
- Secrets via environment variables
- Bandit security scanning
- Dependency vulnerability checks
- Type safety with mypy
- Input validation at boundaries
All services expose HTTP health endpoints:
# Externally accessible (Docker Compose)
curl http://localhost:8003/health # Dashboard
curl http://localhost:8005/health # API health check port
# Internal only (available from within Docker network, or local dev)
curl http://localhost:8000/health # Extractor
curl http://localhost:8001/health # Graphinator
curl http://localhost:8002/health # Tableinator
curl http://localhost:8007/health # Explore
curl http://localhost:8009/health # Insights
curl http://localhost:8010/health # Brainztableinator
curl http://localhost:8011/health # Brainzgraphinator- Structured logging with emojis
- Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Service-specific log files
- Centralized log aggregation ready
See Logging Guide for details.
- Processing rates (records/second)
- Queue depths and consumer counts
- Database connection pool stats
- Memory and CPU usage
- Error rates and retry counts
See Monitoring for details.
| Data Type | Record Count | XML Size | Initial Load | Update Run |
|---|---|---|---|---|
| π Releases | ~19 million | ~11GB | ~40 hours | ~26 hours |
| π€ Artists | ~10 million | ~461MB | ~21 hours | ~14 hours |
| π΅ Masters | ~2.5 million | ~575MB | ~4.5 hours | ~4 hours |
| π’ Labels | ~2.3 million | ~84MB | ~4 hours | ~3 hours |
π Total: ~34 million records β’ ~11.3GB compressed β’ ~76GB on disk (28GB Neo4j + 48GB PostgreSQL)
β±οΈ Initial load: ~2 days (parallel, limited by releases) β’ Update run: ~26 hours (~5x faster)
Nodes: ~33.8 million
| Node Label | Count |
|---|---|
| Release | ~19 million |
| Artist | ~10 million |
| Master | ~2.5 million |
| Label | ~2.4 million |
| Style | 757 |
| Genre | 16 |
Relationships: ~134.3 million
| Relationship Type | Count | Description |
|---|---|---|
| IS | ~61.2 million | Release/Master β Style/Genre |
| BY | ~26 million | Release/Master β Artist |
| ON | ~20.6 million | Release β Label |
| DERIVED_FROM | ~19 million | Release β Master |
| ALIAS_OF | ~4.9 million | Artist β Artist (aliases) |
| MEMBER_OF | ~2.3 million | Artist β Artist (group membership) |
| SUBLABEL_OF | ~278K | Label β Label (parent/child) |
| PART_OF | ~10K | Style β Genre membership |
Stateless Services (can scale horizontally):
- API (load balanced β JWT validation is stateless)
- Extractor (one instance per data type)
- Graphinator (multiple consumers per queue)
- Tableinator (multiple consumers per queue)
- Brainzgraphinator (multiple consumers per queue)
- Brainztableinator (multiple consumers per queue)
- Explore (load balanced)
- Dashboard (load balanced)
- Insights (load balanced)
Stateful Services (scale vertically):
- Neo4j (clustering available in enterprise)
- PostgreSQL (replication supported)
- RabbitMQ (clustering supported)
- Redis (clustering supported)
- Batch size optimization
- Prefetch count tuning
- Connection pool sizing
- Index optimization
- Query caching strategies (Redis cache-aside pattern)
- Pre-computed aggregate properties on graph nodes
- Neo4j Cypher query plan optimization (CALL {} barriers, pattern comprehension)
See Performance Guide for general strategies and Query Performance Optimizations for the detailed Cypher optimization report (249x overall improvement).
docker-compose up -dPros:
- Easy setup
- All services on one machine
- Good for development and testing
Cons:
- Limited scalability
- Single point of failure
Recommended for:
- Production deployments
- High availability requirements
- Auto-scaling needs
- Multi-node clusters
Components:
- Deployments for stateless services
- StatefulSets for databases
- Services for load balancing
- ConfigMaps and Secrets
- Persistent volumes
- Quick Start Guide - Get started quickly
- Configuration Guide - Environment variables and settings
- Database Schema - Detailed schema documentation
- Performance Guide - Optimization strategies
- Monitoring Guide - Observability and debugging
Last Updated: 2026-04-03