Skip to content

🎢 Using the discogs database export for local graph exploration. 🎢

License

Notifications You must be signed in to change notification settings

SimplicityGuy/discogsography

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎡 Discogsography

Build Code Quality Tests E2E Tests License: MIT Python 3.13+ Rust uv just Ruff Cargo Clippy pre-commit mypy Bandit Docker

A modern Python 3.13+ microservices platform for transforming the complete Discogs music database into powerful, queryable knowledge graphs and analytics engines.

πŸš€ Quick Start | πŸ“– Documentation | 🎯 Features | πŸ’¬ Community | πŸ“‹ Emoji Guide

🎯 What is Discogsography?

Discogsography transforms monthly Discogs data dumps (50GB+ compressed XML) into:

  • πŸ”— Neo4j Graph Database: Navigate complex music industry relationships
  • 🐘 PostgreSQL Database: High-performance queries and full-text search
  • πŸ€– AI Discovery Engine: Intelligent recommendations and analytics
  • πŸ“Š Real-time Dashboard: Monitor system health and processing metrics

Perfect for music researchers, data scientists, developers, and music enthusiasts who want to explore the world's largest music database.

πŸ›οΈ Architecture Overview

βš™οΈ Core Services

Service Purpose Key Technologies
πŸ“₯ Python Extractor Downloads & processes Discogs XML dumps (Python) asyncio, orjson, aio-pika
⚑ Rust Extractor High-performance Rust-based extractor tokio, quick-xml, lapin
πŸ”— Graphinator Builds Neo4j knowledge graphs neo4j-driver, graph algorithms
🐘 Tableinator Creates PostgreSQL analytics tables psycopg3, JSONB, full-text search
🎡 Discovery AI-powered music intelligence sentence-transformers, plotly, networkx
πŸ“Š Dashboard Real-time system monitoring FastAPI, WebSocket, reactive UI

πŸ“ System Architecture

graph TD
    S3[("🌐 Discogs S3<br/>Monthly Data Dumps<br/>~50GB XML")]
    PYEXT[["πŸ“₯ Python Extractor<br/>XML β†’ JSON<br/>Deduplication"]]
    RSEXT[["⚑ Rust Extractor<br/>High-Performance<br/>XML Processing"]]
    RMQ{{"🐰 RabbitMQ<br/>Message Broker<br/>4 Queues"}}
    NEO4J[("πŸ”— Neo4j<br/>Graph Database<br/>Relationships")]
    PG[("🐘 PostgreSQL<br/>Analytics DB<br/>Full-text Search")]
    GRAPH[["πŸ”— Graphinator<br/>Graph Builder"]]
    TABLE[["🐘 Tableinator<br/>Table Builder"]]
    DASH[["πŸ“Š Dashboard<br/>Real-time Monitor<br/>WebSocket"]]
    DISCO[["🎡 Discovery<br/>AI Engine<br/>ML Models"]]

    S3 -->|1a. Download & Parse| PYEXT
    S3 -->|1b. Download & Parse| RSEXT
    PYEXT -->|2. Publish Messages| RMQ
    RSEXT -->|2. Publish Messages| RMQ
    RMQ -->|3a. Artists/Labels/Releases/Masters| GRAPH
    RMQ -->|3b. Artists/Labels/Releases/Masters| TABLE
    GRAPH -->|4a. Build Graph| NEO4J
    TABLE -->|4b. Store Data| PG

    DISCO -.->|Query| NEO4J
    DISCO -.->|Query| PG
    DISCO -.->|Analyze| DISCO

    DASH -.->|Monitor| PYEXT
    DASH -.->|Monitor| RSEXT
    DASH -.->|Monitor| GRAPH
    DASH -.->|Monitor| TABLE
    DASH -.->|Monitor| DISCO
    DASH -.->|Stats| RMQ
    DASH -.->|Stats| NEO4J
    DASH -.->|Stats| PG

    style S3 fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style PYEXT fill:#fff9c4,stroke:#f57c00,stroke-width:2px
    style RSEXT fill:#ffccbc,stroke:#d84315,stroke-width:2px
    style RMQ fill:#fff3e0,stroke:#e65100,stroke-width:2px
    style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style DASH fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    style DISCO fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px
Loading

🌟 Key Features

πŸš€ Performance & Scale

  • ⚑ High-Speed Processing: 5,000-10,000 records/second XML parsing
  • πŸ”„ Smart Deduplication: SHA256 hash-based change detection prevents reprocessing
  • πŸ“ˆ Handles Big Data: Processes 15M+ releases, 2M+ artists efficiently
  • 🎯 Concurrent Processing: Multi-threaded parsing with async message handling

πŸ›‘οΈ Reliability & Operations

  • πŸ” Auto-Recovery: Automatic retries with exponential backoff
  • πŸ’Ύ Message Durability: RabbitMQ persistence with dead letter queues
  • πŸ₯ Health Monitoring: HTTP health checks for all services
  • πŸ“Š Real-time Metrics: WebSocket dashboard with live updates

πŸ”’ Security & Quality

  • πŸ‹ Container Security: Non-root users, read-only filesystems, dropped capabilities
  • πŸ” Code Security: Bandit scanning, secure defaults, parameterized queries
  • πŸ“ Type Safety: Full type hints with strict mypy validation
  • βœ… Comprehensive Testing: Unit, integration, and E2E tests with Playwright

πŸ€– AI & Analytics

  • 🧠 ML-Powered Discovery: Semantic search using sentence transformers
  • πŸ“Š Industry Analytics: Genre trends, label insights, market analysis
  • πŸ” Graph Algorithms: PageRank, community detection, path finding
  • 🎨 Interactive Visualizations: Plotly charts, vis.js network graphs

πŸ“– Documentation

🎯 Essential Guides

Document Purpose
CLAUDE.md πŸ€– Claude Code integration guide & development standards
Documentation Index πŸ“š Complete documentation directory with all guides
GitHub Actions Guide πŸš€ CI/CD workflows, automation & best practices
Task Automation ⚑ Complete taskipy command reference

πŸ—οΈ Development Standards

Document Purpose
Monorepo Guide πŸ“¦ Managing Python monorepo with shared dependencies
Testing Guide πŸ§ͺ Comprehensive testing strategies and patterns
Logging Guide πŸ“Š Structured logging standards and practices
Python Version Management 🐍 Managing Python 3.13+ across the project

πŸ›‘οΈ Operations & Security

Document Purpose
Docker Security πŸ”’ Container hardening & security practices
Dockerfile Standards πŸ‹ Best practices for writing Dockerfiles
Database Resilience πŸ’Ύ Database connection patterns & error handling
Performance Guide ⚑ Performance optimization strategies

πŸ“‹ Features & References

Document Purpose
Consumer Cancellation πŸ”„ File completion and consumer lifecycle
Platform Targeting 🎯 Cross-platform compatibility
Emoji Guide πŸ“‹ Standardized emoji usage
Recent Improvements πŸš€ Latest platform enhancements
Service Guides πŸ“š Individual README for each service

πŸš€ Quick Start

βœ… Prerequisites

Requirement Minimum Recommended Notes
Python 3.13+ Latest Install via uv
Docker 20.10+ Latest With Docker Compose v2
Storage 100GB 200GB SSD For data + processing
Memory 8GB 16GB+ More RAM = faster processing
Network 10 Mbps 100 Mbps+ Initial download ~50GB

🐳 Using Docker Compose (Recommended)

# 1. Clone and navigate to the repository
git clone https://github.com/SimplicityGuy/discogsography.git
cd discogsography

# 2. Copy environment template (optional - has sensible defaults)
cp .env.example .env

# 3. Start all services (default: Python Extractor)
docker-compose up -d

# 3b. (Optional) Use high-performance Rust Extractor instead
./scripts/switch-extractor.sh rust
# To switch back to Python Extractor: ./scripts/switch-extractor.sh python

# 4. Watch the magic happen!
docker-compose logs -f

# 5. Access the dashboard
open http://localhost:8003

🌐 Service Access

Service URL Default Credentials Purpose
πŸ“Š Dashboard http://localhost:8003 None System monitoring
🎡 Discovery http://localhost:8005 None AI music discovery
🐰 RabbitMQ http://localhost:15672 discogsography / discogsography Queue management
πŸ”— Neo4j http://localhost:7474 neo4j / discogsography Graph exploration
🐘 PostgreSQL localhost:5433 discogsography / discogsography Database access

πŸ’» Local Development

Quick Setup

# 1. Install uv (10-100x faster than pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install just (task runner)
brew install just  # macOS
# or: cargo install just
# or: https://just.systems/install.sh

# 3. Install all dependencies
just install

# 4. Set up pre-commit hooks
just init

# 5. Run any service
just dashboard         # Monitoring UI
just discovery         # AI discovery
just pyextractor       # Python data ingestion
just rustextractor-run # Rust data ingestion (requires cargo)
just graphinator       # Neo4j builder
just tableinator       # PostgreSQL builder

Environment Setup

Create a .env file or export variables:

# Core connections
export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/"

# Neo4j settings
export NEO4J_ADDRESS="bolt://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"

# PostgreSQL settings
export POSTGRES_ADDRESS="localhost:5433"
export POSTGRES_USERNAME="postgres"
export POSTGRES_PASSWORD="password"
export POSTGRES_DATABASE="discogsography"

βš™οΈ Configuration

πŸ”§ Environment Variables

All configuration is managed through environment variables. Copy .env.example to .env:

cp .env.example .env

Core Settings

Variable Description Default Used By
AMQP_CONNECTION RabbitMQ URL amqp://guest:guest@localhost:5672/ All services
DISCOGS_ROOT Data storage path /discogs-data Python/Rust Extractors
PERIODIC_CHECK_DAYS Update check interval 15 Python/Rust Extractors
PYTHON_VERSION Python version for builds 3.13 Docker, CI/CD

Database Connections

Variable Description Default Used By
NEO4J_ADDRESS Neo4j bolt URL bolt://localhost:7687 Graphinator, Dashboard, Discovery
NEO4J_USERNAME Neo4j username neo4j Graphinator, Dashboard, Discovery
NEO4J_PASSWORD Neo4j password Required Graphinator, Dashboard, Discovery
POSTGRES_ADDRESS PostgreSQL host:port localhost:5432 Tableinator, Dashboard, Discovery
POSTGRES_USERNAME PostgreSQL username postgres Tableinator, Dashboard, Discovery
POSTGRES_PASSWORD PostgreSQL password Required Tableinator, Dashboard, Discovery
POSTGRES_DATABASE Database name discogsography Tableinator, Dashboard, Discovery

Consumer Management Settings

Variable Description Default Used By
CONSUMER_CANCEL_DELAY Seconds before canceling idle consumers after file completion 300 (5 min) Graphinator, Tableinator
RECONNECT_INTERVAL Seconds between periodic reconnection attempts for completed files 86400 (24 hrs) Graphinator, Tableinator
EMPTY_QUEUE_TIMEOUT Seconds to wait for messages before disconnecting on reconnect 1800 (30 min) Graphinator, Tableinator

πŸ“ Note: The consumer management settings enable automatic reconnection after file processing completes. This ensures that if the extractor processes new Discogs files later, the downstream services will automatically resume consuming messages without manual intervention.

πŸ’Ώ Dataset Scale

Data Type Record Count XML Size Processing Time
πŸ“€ Releases ~15 million ~40GB 1-3 hours
🎀 Artists ~2 million ~5GB 15-30 mins
🎡 Masters ~2 million ~3GB 10-20 mins
🏒 Labels ~1.5 million ~2GB 10-15 mins

πŸ“Š Total: ~20 million records β€’ 50GB compressed β€’ 100GB processed

πŸ’‘ Usage Examples

Once your data is loaded, explore the music universe through powerful queries and AI-driven insights.

πŸ”— Neo4j Graph Queries

Navigate the interconnected world of music with Cypher queries:

Find all albums by an artist

MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.year
ORDER BY r.year
LIMIT 10

Discover band members

MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name, member.real_name

Explore label catalogs

MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
WHERE r.year >= 1950 AND r.year <= 1970
RETURN r.title, r.artist, r.year
ORDER BY r.year

Find artist collaborations

MATCH (a1:Artist {name: "Miles Davis"})-[:COLLABORATED_WITH]-(a2:Artist)
RETURN DISTINCT a2.name
ORDER BY a2.name

🐘 PostgreSQL Queries

Fast structured queries on denormalized data:

Full-text search releases

SELECT
    data->>'title' as title,
    data->>'artist' as artist,
    data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
ORDER BY (data->>'year')::int DESC
LIMIT 10;

Artist discography

SELECT
    data->>'title' as title,
    data->>'year' as year,
    data->'genres' as genres
FROM releases
WHERE data->>'artist' = 'Miles Davis'
AND (data->>'year')::int BETWEEN 1950 AND 1960
ORDER BY (data->>'year')::int;

Genre statistics

SELECT
    genre,
    COUNT(*) as release_count,
    MIN((data->>'year')::int) as first_release,
    MAX((data->>'year')::int) as last_release
FROM releases,
     jsonb_array_elements_text(data->'genres') as genre
GROUP BY genre
ORDER BY release_count DESC
LIMIT 20;

πŸ“ˆ Monitoring & Operations

πŸ“Š Dashboard

Access the real-time monitoring dashboard at http://localhost:8003:

  • Service Health: Live status of all microservices
  • Queue Metrics: Message rates, depths, and consumer counts
  • Database Stats: Connection pools and storage usage
  • Activity Log: Recent system events and processing updates
  • WebSocket Updates: Real-time data without page refresh

πŸ” Debug Utilities

Monitor and debug your system with built-in tools:

# Check service logs for errors
uv run task check-errors

# Monitor RabbitMQ queues in real-time
uv run task monitor

# Comprehensive system health dashboard
uv run task system-monitor

# View logs for all services
uv run task logs

πŸ“Š Metrics

Each service provides detailed telemetry:

  • Processing Rates: Records/second for each data type
  • Queue Health: Depth, consumer count, throughput
  • Error Tracking: Failed messages, retry counts
  • Performance: Processing time, memory usage
  • Stall Detection: Alerts when processing stops

πŸ‘¨β€πŸ’» Development

πŸ› οΈ Modern Python Stack

The project leverages cutting-edge Python tooling:

Tool Purpose Configuration
uv 10-100x faster package management pyproject.toml
ruff Lightning-fast linting & formatting pyproject.toml
mypy Strict static type checking pyproject.toml
bandit Security vulnerability scanning pyproject.toml
pre-commit Git hooks for code quality .pre-commit-config.yaml

πŸ§ͺ Testing

Comprehensive test coverage with multiple test types:

# Run all tests (excluding E2E)
uv run task test

# Run with coverage report
uv run task test-cov

# Run specific test suites
uv run pytest tests/extractor/      # Extractor tests (Python)
uv run pytest tests/graphinator/    # Graphinator tests
uv run pytest tests/tableinator/    # Tableinator tests
uv run pytest tests/dashboard/      # Dashboard tests

🎭 E2E Testing with Playwright

# One-time browser setup
uv run playwright install chromium
uv run playwright install-deps chromium

# Run E2E tests (automatic server management)
uv run task test-e2e

# Run with specific browser
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e --browser firefox

πŸ”§ Development Workflow

# Setup development environment
uv sync --all-extras
uv run task init  # Install pre-commit hooks

# Before committing
just lint     # Run linting
just format   # Format code
uv run task test     # Run tests
just security # Security scan

# Or run everything at once
uv run pre-commit run --all-files

πŸ“ Project Structure

discogsography/
β”œβ”€β”€ πŸ“¦ common/              # Shared utilities and configuration
β”‚   β”œβ”€β”€ config.py           # Centralized configuration management
β”‚   └── health_server.py    # Health check endpoint server
β”œβ”€β”€ πŸ“Š dashboard/           # Real-time monitoring dashboard
β”‚   β”œβ”€β”€ dashboard.py        # FastAPI backend with WebSocket
β”‚   └── static/             # Frontend HTML/CSS/JS
β”œβ”€β”€ πŸ“₯ extractor/           # Data extraction services
β”‚   β”œβ”€β”€ pyextractor/        # Python-based Discogs data ingestion
β”‚   β”‚   β”œβ”€β”€ extractor.py    # Main processing logic
β”‚   β”‚   └── discogs.py      # S3 download and validation
β”‚   └── rustextractor/      # Rust-based high-performance extractor
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   └── main.rs     # Rust processing logic
β”‚       └── Cargo.toml      # Rust dependencies
β”œβ”€β”€ πŸ”— graphinator/         # Neo4j graph database service
β”‚   └── graphinator.py      # Graph relationship builder
β”œβ”€β”€ 🐘 tableinator/         # PostgreSQL storage service
β”‚   └── tableinator.py      # Relational data management
β”œβ”€β”€ πŸ”§ utilities/           # Operational tools
β”‚   β”œβ”€β”€ check_errors.py     # Log analysis
β”‚   β”œβ”€β”€ monitor_queues.py   # Real-time queue monitoring
β”‚   └── system_monitor.py   # System health dashboard
β”œβ”€β”€ πŸ§ͺ tests/               # Comprehensive test suite
β”œβ”€β”€ πŸ“ docs/                # Additional documentation
β”œβ”€β”€ πŸ‹ docker-compose.yml   # Container orchestration
└── πŸ“¦ pyproject.toml       # Project configuration

Logging Conventions

All logger calls (logger.info, logger.warning, logger.error) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.

Emoji Key

Emoji Usage Example
πŸš€ Startup messages logger.info("πŸš€ Starting service...")
βœ… Success/completion messages logger.info("βœ… Operation completed successfully")
❌ Errors logger.error("❌ Failed to connect to database")
⚠️ Warnings logger.warning("⚠️ Connection timeout, retrying...")
πŸ›‘ Shutdown/stop messages logger.info("πŸ›‘ Shutting down gracefully")
πŸ“Š Progress/statistics logger.info("πŸ“Š Processed 1000 records")
πŸ“₯ Downloads logger.info("πŸ“₯ Starting download of data")
⬇️ Downloading files logger.info("⬇️ Downloading file.xml")
πŸ”„ Processing operations logger.info("πŸ”„ Processing batch of messages")
⏳ Waiting/pending logger.info("⏳ Waiting for messages...")
πŸ“‹ Metadata operations logger.info("πŸ“‹ Loaded metadata from cache")
πŸ” Checking/searching logger.info("πŸ” Checking for updates...")
πŸ“„ File operations logger.info("πŸ“„ File created successfully")
πŸ†• New versions logger.info("πŸ†• Found newer version available")
⏰ Periodic operations logger.info("⏰ Running periodic check")
πŸ”§ Setup/configuration logger.info("πŸ”§ Creating database indexes")
🐰 RabbitMQ connections logger.info("🐰 Connected to RabbitMQ")
πŸ”— Neo4j connections logger.info("πŸ”— Connected to Neo4j")
🐘 PostgreSQL operations logger.info("🐘 Connected to PostgreSQL")
πŸ’Ύ Database save operations logger.info("πŸ’Ύ Updated artist ID=123 in Neo4j")
πŸ₯ Health server logger.info("πŸ₯ Health server started on port 8001")
⏩ Skipping operations logger.info("⏩ Skipped artist ID=123 (no changes)")

Example Usage

logger.info("πŸš€ Starting Discogs data extractor")
logger.error("❌ Failed to connect to Neo4j: connection refused")
logger.warning("⚠️ Slow consumer detected, processing delayed")
logger.info("βœ… All files processed successfully")

πŸ—„οΈ Data Schema

πŸ”— Neo4j Graph Model

The graph database models complex music industry relationships:

Node Types

Node Description Key Properties
Artist Musicians, bands, producers id, name, real_name, profile
Label Record labels and imprints id, name, profile, parent_label
Master Master recordings id, title, year, main_release
Release Physical/digital releases id, title, year, country, format
Genre Musical genres name
Style Sub-genres and styles name

Relationships

🎀 Artist Relationships:
β”œβ”€β”€ MEMBER_OF ──────→ Artist (band membership)
β”œβ”€β”€ ALIAS_OF ───────→ Artist (alternative names)
β”œβ”€β”€ COLLABORATED_WITH β†’ Artist (collaborations)
└── PERFORMED_ON ───→ Release (credits)

πŸ“€ Release Relationships:
β”œβ”€β”€ BY ────────────→ Artist (performer credits)
β”œβ”€β”€ ON ────────────→ Label (release label)
β”œβ”€β”€ DERIVED_FROM ──→ Master (master recording)
β”œβ”€β”€ IS ────────────→ Genre (genre classification)
└── IS ────────────→ Style (style classification)

🏒 Label Relationships:
└── SUBLABEL_OF ───→ Label (parent/child labels)

🎡 Classification:
└── Style -[:PART_OF]β†’ Genre (hierarchy)

🐘 PostgreSQL Schema

Optimized for fast queries and full-text search:

-- Artists table with JSONB for flexible schema
CREATE TABLE artists (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_artists_name ON artists ((data->>'name'));
CREATE INDEX idx_artists_gin ON artists USING GIN (data);

-- Labels table
CREATE TABLE labels (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_labels_name ON labels ((data->>'name'));

-- Masters table
CREATE TABLE masters (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_masters_title ON masters ((data->>'title'));
CREATE INDEX idx_masters_year ON masters ((data->>'year'));

-- Releases table with extensive indexing
CREATE TABLE releases (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_releases_title ON releases ((data->>'title'));
CREATE INDEX idx_releases_artist ON releases ((data->>'artist'));
CREATE INDEX idx_releases_year ON releases ((data->>'year'));
CREATE INDEX idx_releases_gin ON releases USING GIN (data);

⚑ Performance & Optimization

πŸ“Š Processing Speed

Typical processing rates on modern hardware:

Service Records/Second Bottleneck
πŸ“₯ Python Extractor 5,000-10,000 XML parsing, I/O
⚑ Rust Extractor 20,000-400,000+ Network I/O (Rust-based)
πŸ”— Graphinator 1,000-2,000 Neo4j transactions
🐘 Tableinator 3,000-5,000 PostgreSQL inserts

πŸ’» Hardware Requirements

Minimum Specifications

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 200GB HDD
  • Network: 10 Mbps

Recommended Specifications

  • CPU: 8+ cores
  • RAM: 16GB+
  • Storage: 200GB+ SSD (NVMe preferred)
  • Network: 100 Mbps+

πŸš€ Optimization Guide

Database Tuning

Neo4j Configuration:

# neo4j.conf
dbms.memory.heap.initial_size=4g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g

PostgreSQL Configuration:

-- postgresql.conf
shared_buffers = 4GB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_cache_size = 12GB

Message Queue Optimization

# RabbitMQ prefetch for consumers
PREFETCH_COUNT: 100  # Adjust based on processing speed

Storage Performance

  • Use SSD/NVMe for /discogs-data directory
  • Enable compression for PostgreSQL tables
  • Configure Neo4j for SSD optimization
  • Use separate disks for databases if possible

πŸ”§ Troubleshooting

❌ Common Issues & Solutions

Python/Rust Extractor Download Failures

# Check connectivity
curl -I https://discogs-data-dumps.s3.us-west-2.amazonaws.com

# Verify disk space
df -h /discogs-data

# Check permissions
ls -la /discogs-data

Solutions:

  • βœ… Ensure internet connectivity
  • βœ… Verify 100GB+ free space
  • βœ… Check directory permissions

RabbitMQ Connection Issues

# Check RabbitMQ status
docker-compose ps rabbitmq
docker-compose logs rabbitmq

# Test connection
curl -u discogsography:discogsography http://localhost:15672/api/overview

Solutions:

  • βœ… Wait for RabbitMQ startup (30-60s)
  • βœ… Check firewall settings
  • βœ… Verify credentials in .env

Database Connection Errors

Neo4j:

# Check Neo4j status
docker-compose logs neo4j
curl http://localhost:7474

# Test bolt connection
echo "MATCH (n) RETURN count(n);" | cypher-shell -u neo4j -p discogsography

PostgreSQL:

# Check PostgreSQL status
docker-compose logs postgres

# Test connection
PGPASSWORD=discogsography psql -h localhost -U discogsography -d discogsography -c "SELECT 1;"

πŸ› Debugging Guide

  1. πŸ“‹ Check Service Health

    curl http://localhost:8000/health  # Python/Rust Extractor
    curl http://localhost:8001/health  # Graphinator
    curl http://localhost:8002/health  # Tableinator
    curl http://localhost:8003/health  # Dashboard
    curl http://localhost:8004/health  # Discovery
  2. πŸ“Š Monitor Real-time Logs

    # All services
    uv run task logs
    
    # Specific service
    docker-compose logs -f extractor-python  # For Python Extractor
    docker-compose logs -f extractor-rust    # For Rust Extractor
  3. πŸ” Analyze Errors

    # Check for errors across all services
    uv run task check-errors
    
    # Monitor queue health
    uv run task monitor
  4. πŸ—„οΈ Verify Data Storage

    -- Neo4j: Check node counts
    MATCH (n) RETURN labels(n)[0] as type, count(n) as count;
    -- PostgreSQL: Check table counts
    SELECT 'artists' as table_name, COUNT(*) FROM artists
    UNION ALL
    SELECT 'releases', COUNT(*) FROM releases
    UNION ALL
    SELECT 'labels', COUNT(*) FROM labels
    UNION ALL
    SELECT 'masters', COUNT(*) FROM masters;

🀝 Contributing

We welcome contributions! Please follow these guidelines:

πŸ“‹ Contribution Process

  1. Fork & Clone

    git clone https://github.com/YOUR_USERNAME/discogsography.git
    cd discogsography
  2. Setup Development Environment

    uv sync --all-extras
    uv run task init  # Install pre-commit hooks
  3. Create Feature Branch

    git checkout -b feature/amazing-feature
  4. Make Changes

    • Write clean, documented code
    • Add comprehensive tests
    • Update relevant documentation
  5. Validate Changes

    just lint      # Fix any linting issues
    just test      # Ensure tests pass
    just security  # Check for vulnerabilities
  6. Commit with Conventional Commits

    git commit -m "feat: add amazing feature"
    # Types: feat, fix, docs, style, refactor, test, chore
  7. Push & Create PR

    git push origin feature/amazing-feature

πŸ“ Development Standards

  • Code Style: Follow ruff and black formatting
  • Type Hints: Required for all functions
  • Tests: Maintain >80% coverage
  • Docs: Update README and docstrings
  • Logging: Use emoji conventions (see above)
  • Security: Pass bandit checks

πŸ”§ Maintenance

Package Upgrades

Keep dependencies up-to-date with the provided upgrade script:

# Safely upgrade all dependencies (minor/patch versions)
./scripts/upgrade-packages.sh

# Preview what would be upgraded
./scripts/upgrade-packages.sh --dry-run

# Include major version upgrades
./scripts/upgrade-packages.sh --major

The script includes:

  • πŸ”’ Automatic backups before upgrades
  • βœ… Git safety checks (requires clean working directory)
  • πŸ§ͺ Automatic testing after upgrades
  • πŸ“¦ Comprehensive dependency management across all services

See scripts/README.md for more maintenance scripts.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • 🎡 Discogs for providing the monthly data dumps
  • 🐍 The Python community for excellent libraries and tools
  • 🌟 All contributors who help improve this project
  • πŸš€ uv for blazing-fast package management
  • πŸ”₯ Ruff for lightning-fast linting

πŸ’¬ Support & Community

Get Help

Documentation

Project Status

This project is actively maintained. We welcome contributions, bug reports, and feature requests!


Made with ❀️ by the Discogsography community

Sponsor this project

Contributors 4

  •  
  •  
  •  
  •