Skip to content

Commit 7f531d7

Browse files
committed
Bring back Dockerfile
0 parents  commit 7f531d7

File tree

5 files changed

+370
-0
lines changed

5 files changed

+370
-0
lines changed

.gitignore

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*.pyo
5+
*.pyd
6+
*.pdb
7+
*.egg
8+
*.egg-info/
9+
dist/
10+
build/
11+
*.egg-info/
12+
*.eggs/
13+
*.whl
14+
*.manifest
15+
*.spec
16+
pip-log.txt
17+
pip-delete-this-directory.txt
18+
*.log
19+
*.pot
20+
*.mo
21+
*.coverage
22+
.coverage.*
23+
.cache
24+
.pytest_cache/
25+
nosetests.xml
26+
coverage.xml
27+
*.cover
28+
.hypothesis/
29+
*.orig
30+
*.rej
31+
.svn/
32+
*.idea/
33+
*.vscode/
34+
*.ropeproject/
35+
*.mypy_cache/
36+
.dmypy.json
37+
.pyre/
38+
.pytype/
39+
cython_debug/
40+
41+
# Docker
42+
*.log
43+
*.tmp
44+
*.pid
45+
*.bak
46+
*.swp
47+
.dockerignore
48+
docker-compose.yml
49+
docker-compose.override.yml
50+
.env
51+
.env.*
52+
53+
# Postgres
54+
*.sql
55+
*.dump
56+
*.backup
57+
pgdata/
58+
pg_log/
59+
pg_xlog/
60+
pg_replslot/
61+
pg_tblspc/
62+
pg_twophase/
63+
pg_stat_tmp/
64+
pg_subtrans/
65+
pg_snapshots/
66+
pg_serial/
67+
pg_notify/
68+
pg_multixact/
69+
pg_logical/
70+
pg_dynshmem/
71+
pg_commit_ts/
72+
pg_clog/
73+
pg_stat/
74+
pg_replslot/
75+
pg_wal/
76+
pg_xact/
77+
pg_hba.conf
78+
pg_ident.conf
79+
postgresql.auto.conf
80+
postgresql.conf

Dockerfile

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
FROM postgres:latest
2+
3+
# Install PostgreSQL extensions or tools
4+
RUN apt-get update && apt-get install -y postgresql-contrib
5+
6+
# Copy initial scripts
7+
COPY init.sql /docker-entrypoint-initdb.d/

Dockerfile.data-generator

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Use the Python base image
2+
FROM python:3.9-slim
3+
4+
# Set the working directory in the container
5+
WORKDIR /scripts
6+
7+
# Copy the scripts folder into the container
8+
COPY scripts /scripts
9+
10+
# Install dependencies
11+
RUN pip install psycopg2-binary faker
12+
13+
# Set the default command to run your script
14+
CMD ["python", "generate_data.py"]

README.md

+220
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
Hands-on Guide to System Design with PostgreSQL
2+
3+
PostgreSQL is a powerful relational database system with extensive features to help developers and engineers build scalable and efficient systems. This hands-on guide demonstrates how to approach system design using PostgreSQL, emphasizing practical experimentation and concepts like indexing, caching, and query optimization. It leverages what we've explored to foster a deeper understanding of PostgreSQL's role in system design.
4+
5+
Introduction to System Design with PostgreSQL
6+
7+
System design involves creating efficient, scalable, and maintainable systems to meet functional and non-functional requirements. PostgreSQL’s features make it an excellent choice for designing systems with:
8+
9+
High Performance: Optimized for both OLTP and OLAP workloads.
10+
11+
Scalability: Parallel queries, partitioning, and support for massive datasets.
12+
13+
Reliability: ACID compliance and strong support for constraints.
14+
15+
Extensibility: Rich extensions (e.g., PostGIS, pgAudit) and advanced features (e.g., JSON support).
16+
17+
This guide will help you:
18+
19+
Understand PostgreSQL's core scaling strategies.
20+
21+
Explore practical experiments to test system behavior.
22+
23+
Apply learnings to real-world system design.
24+
25+
Experiment 1: Query Optimization and Indexing
26+
27+
Objective:
28+
29+
Understand how PostgreSQL optimizes queries with and without indexes.
30+
31+
Steps:
32+
33+
1. Create a Large Dataset
34+
35+
Populate a table with 1 million customer records:
36+
37+
CREATE TABLE customers (
38+
id SERIAL PRIMARY KEY,
39+
name VARCHAR(255),
40+
email VARCHAR(255) UNIQUE,
41+
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
42+
);
43+
44+
Generate synthetic data using Python or SQL loops.
45+
46+
2. Query Without an Index
47+
48+
Run a query to filter records by email:
49+
50+
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = '[email protected]';
51+
52+
Expected Behavior: Sequential scans for large tables are costly.
53+
54+
Observation: Use query plans to see the query’s execution path.
55+
56+
3. Add an Index
57+
58+
CREATE INDEX idx_customers_email ON customers(email);
59+
60+
Re-run the query:
61+
62+
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = '[email protected]';
63+
64+
Expected Behavior: Index scan significantly reduces execution time.
65+
66+
Observation: Compare query execution times and buffer usage.
67+
68+
Insights:
69+
70+
Understand the trade-offs between sequential and indexed scans.
71+
72+
Learn how indexes improve query performance.
73+
74+
Experiment 2: Parallelism in PostgreSQL
75+
76+
Objective:
77+
78+
Observe PostgreSQL’s parallel query execution and its impact on performance.
79+
80+
Steps:
81+
82+
1. Enable Parallelism
83+
84+
Ensure parallel queries are enabled in PostgreSQL:
85+
86+
SHOW max_parallel_workers_per_gather;
87+
SHOW parallel_setup_cost;
88+
SHOW parallel_tuple_cost;
89+
90+
2. Execute a Parallel Query
91+
92+
Run a query on a large dataset without indexing:
93+
94+
EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT * FROM customers WHERE name LIKE '%example%';
95+
96+
Expected Behavior: PostgreSQL distributes the scan across multiple workers.
97+
98+
Observation: Examine the Gather and Parallel Seq Scan nodes in the query plan.
99+
100+
3. Adjust Parallelism
101+
102+
Test the impact of different worker settings:
103+
104+
SET max_parallel_workers_per_gather = 4;
105+
106+
Re-run the query and observe the performance difference.
107+
108+
Insights:
109+
110+
Learn how PostgreSQL dynamically scales queries using parallel workers.
111+
112+
Understand the trade-offs between parallelism and resource usage.
113+
114+
Experiment 3: Caching and Memory Optimization
115+
116+
Objective:
117+
118+
Understand PostgreSQL’s caching mechanisms and their impact on performance.
119+
120+
Steps:
121+
122+
1. Monitor Cache Usage
123+
124+
Query PostgreSQL’s pg_stat_database to view cache hit ratios:
125+
126+
SELECT datname, blks_hit, blks_read,
127+
blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio
128+
FROM pg_stat_database;
129+
130+
Expected Behavior: High cache hit ratio (>90%) for frequently accessed data.
131+
132+
Observation: Identify workloads that benefit from memory optimization.
133+
134+
2. Force Disk Reads
135+
136+
Restart the database server to clear cache:
137+
138+
docker restart postgres-container
139+
140+
Re-run queries to observe increased disk reads:
141+
142+
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customers WHERE email = '[email protected]';
143+
144+
3. Tune Memory Settings
145+
146+
Modify PostgreSQL’s shared_buffers to allocate more memory:
147+
148+
SHOW shared_buffers;
149+
ALTER SYSTEM SET shared_buffers = '512MB';
150+
151+
Reload the configuration and observe performance improvements.
152+
153+
Insights:
154+
155+
Learn how caching minimizes disk I/O.
156+
157+
Tune memory settings for optimized resource usage.
158+
159+
Experiment 4: Constraints and Scaling Strategies
160+
161+
Objective:
162+
163+
Test the role of constraints (e.g., unique constraints) and their impact on scaling.
164+
165+
Steps:
166+
167+
1. Drop a Unique Constraint
168+
169+
Remove the unique constraint to simulate a system without strict guarantees:
170+
171+
ALTER TABLE customers DROP CONSTRAINT customers_email_key;
172+
173+
Re-run queries and observe the absence of unique index benefits.
174+
175+
2. Add the Constraint Back
176+
177+
Recreate the unique constraint:
178+
179+
ALTER TABLE customers ADD CONSTRAINT customers_email_key UNIQUE (email);
180+
181+
Expected Behavior: Queries become faster due to the automatic unique index.
182+
183+
Observation: Constraints enforce data integrity and improve performance.
184+
185+
Building Scalable Systems with PostgreSQL
186+
187+
Use these experiments as building blocks for designing scalable and efficient systems:
188+
189+
Indexing Strategies:
190+
191+
Use compound indexes for multi-column queries.
192+
193+
Analyze query patterns to decide which columns to index.
194+
195+
Partitioning:
196+
197+
Partition large tables for better performance.
198+
199+
Use declarative partitioning for time-series or sharded datasets.
200+
201+
Connection Pooling:
202+
203+
Implement pooling with tools like pgbouncer for high-concurrency systems.
204+
205+
High Availability:
206+
207+
Set up replication for fault tolerance.
208+
209+
Use tools like Patroni for automated failover.
210+
211+
Monitoring and Tuning:
212+
213+
Monitor performance using tools like pg_stat_statements.
214+
215+
Continuously tune parameters based on workload.
216+
217+
Conclusion
218+
219+
System design with PostgreSQL is a combination of leveraging its advanced features, understanding its internal mechanisms, and applying best practices. By conducting hands-on experiments and interpreting query plans, you can gain valuable insights into PostgreSQL’s scaling strategies and design systems that perform efficiently under real-world workloads.
220+

scripts/generate_data.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import psycopg2
2+
from faker import Faker
3+
import os
4+
5+
# Database connection details
6+
DB_HOST = os.getenv("POSTGRES_HOST", "localhost")
7+
DB_PORT = os.getenv("POSTGRES_PORT", "5432")
8+
DB_USER = os.getenv("POSTGRES_USER", "admin")
9+
DB_PASSWORD = os.getenv("POSTGRES_PASSWORD", "admin_password")
10+
DB_NAME = os.getenv("POSTGRES_DB", "ecommerce_db")
11+
12+
# Connect to the database
13+
conn = psycopg2.connect(
14+
host=DB_HOST,
15+
port=DB_PORT,
16+
database=DB_NAME,
17+
user=DB_USER,
18+
password=DB_PASSWORD
19+
)
20+
cursor = conn.cursor()
21+
22+
# Use Faker to generate realistic data
23+
faker = Faker()
24+
25+
# Store generated emails to avoid duplicates
26+
generated_emails = set()
27+
28+
# Generate and insert 1 million records
29+
batch_size = 10000
30+
total_records = 1000000
31+
32+
for i in range(0, total_records, batch_size):
33+
records = []
34+
for _ in range(batch_size):
35+
email = faker.email()
36+
# Ensure unique emails
37+
while email in generated_emails:
38+
email = faker.email()
39+
generated_emails.add(email)
40+
records.append((faker.name(), email, faker.date_time_this_decade()))
41+
42+
args_str = ','.join(cursor.mogrify("(%s, %s, %s)", record).decode("utf-8") for record in records)
43+
cursor.execute(f"INSERT INTO customers (name, email, created_at) VALUES {args_str}")
44+
conn.commit()
45+
print(f"{i + batch_size} records inserted...")
46+
47+
print("Data generation complete!")
48+
cursor.close()
49+
conn.close()

0 commit comments

Comments
 (0)