Bring back Dockerfile

farmanp · farmanp · commit 7f531d752dbb · 2025-01-24T11:50:02.000-07:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,80 @@
+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.pdb
+*.egg
+*.egg-info/
+dist/
+build/
+*.egg-info/
+*.eggs/
+*.whl
+*.manifest
+*.spec
+pip-log.txt
+pip-delete-this-directory.txt
+*.log
+*.pot
+*.mo
+*.coverage
+.coverage.*
+.cache
+.pytest_cache/
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+*.orig
+*.rej
+.svn/
+*.idea/
+*.vscode/
+*.ropeproject/
+*.mypy_cache/
+.dmypy.json
+.pyre/
+.pytype/
+cython_debug/
+
+# Docker
+*.log
+*.tmp
+*.pid
+*.bak
+*.swp
+.dockerignore
+docker-compose.yml
+docker-compose.override.yml
+.env
+.env.*
+
+# Postgres
+*.sql
+*.dump
+*.backup
+pgdata/
+pg_log/
+pg_xlog/
+pg_replslot/
+pg_tblspc/
+pg_twophase/
+pg_stat_tmp/
+pg_subtrans/
+pg_snapshots/
+pg_serial/
+pg_notify/
+pg_multixact/
+pg_logical/
+pg_dynshmem/
+pg_commit_ts/
+pg_clog/
+pg_stat/
+pg_replslot/
+pg_wal/
+pg_xact/
+pg_hba.conf
+pg_ident.conf
+postgresql.auto.conf
+postgresql.conf
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,7 @@
+FROM postgres:latest
+
+# Install PostgreSQL extensions or tools
+RUN apt-get update && apt-get install -y postgresql-contrib
+
+# Copy initial scripts
+COPY init.sql /docker-entrypoint-initdb.d/
diff --git a/Dockerfile.data-generator b/Dockerfile.data-generator
@@ -0,0 +1,14 @@
+# Use the Python base image
+FROM python:3.9-slim
+
+# Set the working directory in the container
+WORKDIR /scripts
+
+# Copy the scripts folder into the container
+COPY scripts /scripts
+
+# Install dependencies
+RUN pip install psycopg2-binary faker
+
+# Set the default command to run your script
+CMD ["python", "generate_data.py"]
diff --git a/README.md b/README.md
@@ -0,0 +1,220 @@
+Hands-on Guide to System Design with PostgreSQL
+
+PostgreSQL is a powerful relational database system with extensive features to help developers and engineers build scalable and efficient systems. This hands-on guide demonstrates how to approach system design using PostgreSQL, emphasizing practical experimentation and concepts like indexing, caching, and query optimization. It leverages what we've explored to foster a deeper understanding of PostgreSQL's role in system design.
+
+Introduction to System Design with PostgreSQL
+
+System design involves creating efficient, scalable, and maintainable systems to meet functional and non-functional requirements. PostgreSQL’s features make it an excellent choice for designing systems with:
+
+High Performance: Optimized for both OLTP and OLAP workloads.
+
+Scalability: Parallel queries, partitioning, and support for massive datasets.
+
+Reliability: ACID compliance and strong support for constraints.
+
+Extensibility: Rich extensions (e.g., PostGIS, pgAudit) and advanced features (e.g., JSON support).
+
+This guide will help you:
+
+Understand PostgreSQL's core scaling strategies.
+
+Explore practical experiments to test system behavior.
+
+Apply learnings to real-world system design.
+
+Experiment 1: Query Optimization and Indexing
+
+Objective:
+
+Understand how PostgreSQL optimizes queries with and without indexes.
+
+Steps:
+
+1. Create a Large Dataset
+
+Populate a table with 1 million customer records:
+
+CREATE TABLE customers (
+    id SERIAL PRIMARY KEY,
+    name VARCHAR(255),
+    email VARCHAR(255) UNIQUE,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+Generate synthetic data using Python or SQL loops.
+
+2. Query Without an Index
+
+Run a query to filter records by email:
+
+EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = 'rare_email@example.com';
+
+Expected Behavior: Sequential scans for large tables are costly.
+
+Observation: Use query plans to see the query’s execution path.
+
+3. Add an Index
+
+CREATE INDEX idx_customers_email ON customers(email);
+
+Re-run the query:
+
+EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = 'rare_email@example.com';
+
+Expected Behavior: Index scan significantly reduces execution time.
+
+Observation: Compare query execution times and buffer usage.
+
+Insights:
+
+Understand the trade-offs between sequential and indexed scans.
+
+Learn how indexes improve query performance.
+
+Experiment 2: Parallelism in PostgreSQL
+
+Objective:
+
+Observe PostgreSQL’s parallel query execution and its impact on performance.
+
+Steps:
+
+1. Enable Parallelism
+
+Ensure parallel queries are enabled in PostgreSQL:
+
+SHOW max_parallel_workers_per_gather;
+SHOW parallel_setup_cost;
+SHOW parallel_tuple_cost;
+
+2. Execute a Parallel Query
+
+Run a query on a large dataset without indexing:
+
+EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT * FROM customers WHERE name LIKE '%example%';
+
+Expected Behavior: PostgreSQL distributes the scan across multiple workers.
+
+Observation: Examine the Gather and Parallel Seq Scan nodes in the query plan.
+
+3. Adjust Parallelism
+
+Test the impact of different worker settings:
+
+SET max_parallel_workers_per_gather = 4;
+
+Re-run the query and observe the performance difference.
+
+Insights:
+
+Learn how PostgreSQL dynamically scales queries using parallel workers.
+
+Understand the trade-offs between parallelism and resource usage.
+
+Experiment 3: Caching and Memory Optimization
+
+Objective:
+
+Understand PostgreSQL’s caching mechanisms and their impact on performance.
+
+Steps:
+
+1. Monitor Cache Usage
+
+Query PostgreSQL’s pg_stat_database to view cache hit ratios:
+
+SELECT datname, blks_hit, blks_read,
+       blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio
+FROM pg_stat_database;
+
+Expected Behavior: High cache hit ratio (>90%) for frequently accessed data.
+
+Observation: Identify workloads that benefit from memory optimization.
+
+2. Force Disk Reads
+
+Restart the database server to clear cache:
+
+docker restart postgres-container
+
+Re-run queries to observe increased disk reads:
+
+EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = 'example@example.com';
+
+3. Tune Memory Settings
+
+Modify PostgreSQL’s shared_buffers to allocate more memory:
+
+SHOW shared_buffers;
+ALTER SYSTEM SET shared_buffers = '512MB';
+
+Reload the configuration and observe performance improvements.
+
+Insights:
+
+Learn how caching minimizes disk I/O.
+
+Tune memory settings for optimized resource usage.
+
+Experiment 4: Constraints and Scaling Strategies
+
+Objective:
+
+Test the role of constraints (e.g., unique constraints) and their impact on scaling.
+
+Steps:
+
+1. Drop a Unique Constraint
+
+Remove the unique constraint to simulate a system without strict guarantees:
+
+ALTER TABLE customers DROP CONSTRAINT customers_email_key;
+
+Re-run queries and observe the absence of unique index benefits.
+
+2. Add the Constraint Back
+
+Recreate the unique constraint:
+
+ALTER TABLE customers ADD CONSTRAINT customers_email_key UNIQUE (email);
+
+Expected Behavior: Queries become faster due to the automatic unique index.
+
+Observation: Constraints enforce data integrity and improve performance.
+
+Building Scalable Systems with PostgreSQL
+
+Use these experiments as building blocks for designing scalable and efficient systems:
+
+Indexing Strategies:
+
+Use compound indexes for multi-column queries.
+
+Analyze query patterns to decide which columns to index.
+
+Partitioning:
+
+Partition large tables for better performance.
+
+Use declarative partitioning for time-series or sharded datasets.
+
+Connection Pooling:
+
+Implement pooling with tools like pgbouncer for high-concurrency systems.
+
+High Availability:
+
+Set up replication for fault tolerance.
+
+Use tools like Patroni for automated failover.
+
+Monitoring and Tuning:
+
+Monitor performance using tools like pg_stat_statements.
+
+Continuously tune parameters based on workload.
+
+Conclusion
+
+System design with PostgreSQL is a combination of leveraging its advanced features, understanding its internal mechanisms, and applying best practices. By conducting hands-on experiments and interpreting query plans, you can gain valuable insights into PostgreSQL’s scaling strategies and design systems that perform efficiently under real-world workloads.
+
diff --git a/scripts/generate_data.py b/scripts/generate_data.py
@@ -0,0 +1,49 @@
+import psycopg2
+from faker import Faker
+import os
+
+# Database connection details
+DB_HOST = os.getenv("POSTGRES_HOST", "localhost")
+DB_PORT = os.getenv("POSTGRES_PORT", "5432")
+DB_USER = os.getenv("POSTGRES_USER", "admin")
+DB_PASSWORD = os.getenv("POSTGRES_PASSWORD", "admin_password")
+DB_NAME = os.getenv("POSTGRES_DB", "ecommerce_db")
+
+# Connect to the database
+conn = psycopg2.connect(
+    host=DB_HOST,
+    port=DB_PORT,
+    database=DB_NAME,
+    user=DB_USER,
+    password=DB_PASSWORD
+)
+cursor = conn.cursor()
+
+# Use Faker to generate realistic data
+faker = Faker()
+
+# Store generated emails to avoid duplicates
+generated_emails = set()
+
+# Generate and insert 1 million records
+batch_size = 10000
+total_records = 1000000
+
+for i in range(0, total_records, batch_size):
+    records = []
+    for _ in range(batch_size):
+        email = faker.email()
+        # Ensure unique emails
+        while email in generated_emails:
+            email = faker.email()
+        generated_emails.add(email)
+        records.append((faker.name(), email, faker.date_time_this_decade()))
+
+    args_str = ','.join(cursor.mogrify("(%s, %s, %s)", record).decode("utf-8") for record in records)
+    cursor.execute(f"INSERT INTO customers (name, email, created_at) VALUES {args_str}")
+    conn.commit()
+    print(f"{i + batch_size} records inserted...")
+
+print("Data generation complete!")
+cursor.close()
+conn.close()