This section focuses on understanding and optimizing PostgreSQL queries through practical examples and real-world scenarios. For definitions of terms used in this module, refer to our Glossary.
Before starting this module, ensure you understand:
This workshop consists of the following files:
-
- Core utility for analyzing query execution plans
- Provides colorized and formatted EXPLAIN output
- Used by all other examples
-
- Basic query examples
- Memory vs disk usage demonstrations
- Real-world performance comparisons
-
- Structured exercises for query optimization
- Progressive complexity in examples
- Hands-on optimization practice
-
- Complex query patterns
- Window Functions
- Recursive CTEs
- LATERAL joins
Start with query_explorer.rb
to understand:
- How EXPLAIN works
- Reading execution plans
- Understanding query costs
- Buffer and timing statistics
Work through practice_queries.rb
:
- Run the basic example to see a well-optimized query
- Observe the memory-based execution
- Run the disk usage example to understand:
- How PostgreSQL handles memory constraints
- Impact of work_mem on performance
- Disk vs memory operation differences
ruby examples/03_queries/practice_queries.rb
Complete the exercises in query_optimization_lab.rb
:
- Basic query analysis
- JOIN optimization
- Aggregate Functions optimization
- Subquery optimization
ruby examples/03_queries/query_optimization_lab.rb
Explore complex patterns in advanced_queries.rb
:
- Window functions
- Recursive CTEs
- LATERAL joins
- Complex aggregations
ruby examples/03_queries/advanced_queries.rb
From practice_queries.rb
:
-- Memory-efficient query
SELECT customers.country, COUNT(DISTINCT orders.id)
FROM orders
JOIN customers ON ...
GROUP BY customers.country
-- vs Memory-intensive query
SELECT customers.country,
STRING_AGG(DISTINCT customers.name, ', '),
DATE_TRUNC('month', orders.created_at)
...
From query_optimization_lab.rb
:
-- Simple JOIN
SELECT orders.*
FROM orders
JOIN customers ON ...
-- vs Complex JOIN
SELECT orders.*, COUNT(line_items.id)
FROM orders
JOIN customers ON ...
JOIN line_items ON ...
GROUP BY orders.id
From advanced_queries.rb
:
-- Window Function
SELECT orders.*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at)
FROM orders
-- Recursive CTE
WITH RECURSIVE order_chain AS (
SELECT ... -- Base case
UNION ALL
SELECT ... -- Recursive case
)
QueryExplorer.analyze_query(sql)
Key metrics to watch:
- Planning vs execution time
- Buffer usage patterns
- Sort methods and memory usage
- Row estimation accuracy
For more details on monitoring, see:
After completing this module, proceed to:
- TimescaleDB Extension to learn about time-series optimization
- Ruby Performance to understand Ruby-specific optimizations
In PostgreSQL, a relation is a fundamental database object that represents a set of data. The two main types of relations are:
- Tables: Persistent relations that store data in rows and columns
- Views: Virtual relations defined by a query
- Materialized Views: Cached result sets of a query that can be periodically refreshed
PostgreSQL provides several types of operators:
-
Comparison Operators:
=
,<>
,<
,>
,<=
,>=
BETWEEN
,IN
,LIKE
,ILIKE
-
Logical Operators:
AND
,OR
,NOT
-
Mathematical Operators:
+
,-
,*
,/
,%
^
,|
,&
,<<
,>>
-
Set Operators:
UNION
,INTERSECT
,EXCEPT
JOINs are fundamental operations in relational databases that combine rows from two or more tables based on a related column between them.
-
INNER JOIN:
- Returns only the matching rows between tables
- Most common type of join
- Example:
SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id
-
LEFT (OUTER) JOIN:
- Returns all rows from the left table and matching rows from the right table
- Non-matching rows filled with NULL
- Example:
SELECT * FROM customers LEFT JOIN orders ON customers.id = orders.customer_id
-
RIGHT (OUTER) JOIN:
- Returns all rows from the right table and matching rows from the left table
- Non-matching rows filled with NULL
- Example:
SELECT * FROM orders RIGHT JOIN customers ON orders.customer_id = customers.id
-
FULL (OUTER) JOIN:
- Returns all rows from both tables
- Non-matching rows filled with NULL
- Example:
SELECT * FROM orders FULL JOIN customers ON orders.customer_id = customers.id
-
CROSS JOIN:
- Returns Cartesian product of both tables
- Every row from first table paired with every row from second table
- Example:
SELECT * FROM sizes CROSS JOIN colors
-
NATURAL JOIN:
- Automatically joins tables using columns with the same name
- Implicitly matches ALL columns with same names
- Eliminates duplicate columns in the output
- Example:
SELECT * FROM orders NATURAL JOIN order_items
⚠️ Warning: Use with caution as it can lead to unexpected results if:- Column names change
- New columns with same names are added
- Tables have multiple common column names
-
Numeric Types:
INTEGER
,BIGINT
,SMALLINT
DECIMAL
,NUMERIC
,REAL
,DOUBLE PRECISION
-
Character Types:
CHAR
,VARCHAR
,TEXT
-
Date/Time Types:
DATE
,TIME
,TIMESTAMP
INTERVAL
-
Special Types:
BOOLEAN
UUID
JSON
,JSONB
ARRAY
- Tables: Basic structure for data storage
- Indexes: Improve query performance
- Constraints: Enforce data integrity
- PRIMARY KEY
- FOREIGN KEY
- UNIQUE
- CHECK
- NOT NULL
- Atomicity: All operations in a transaction succeed or all fail
- Consistency: Database remains in a valid state
- Isolation: Concurrent transactions don't interfere
- Durability: Committed changes are permanent
flowchart LR
A[SQL Query] --> B[Parser]
B --> C[Rewriter]
C --> D[Optimizer]
D --> E[Executor]
E --> F[Results]
subgraph "1. Parsing"
B --> B1[Syntax Check]
B1 --> B2[Create Parse Tree]
end
subgraph "2. Rewriting"
C --> C1[View Expansion]
C1 --> C2[Rule Application]
end
subgraph "3. Optimization"
D --> D1[Generate Plans]
D1 --> D2[Cost Estimation]
D2 --> D3[Plan Selection]
end
Every join operation has its own cost and performance characteristics. Let's see how each join type works.
The left join returns all rows from the left table and the matching rows from the right table. If there is no match, the result will have NULL values for the right table columns.
The inner join returns only the matching rows between the two tables. If there is no match, the row is not included in the result.
The full outer join returns all rows from both tables. If there is no match, the result will have NULL values for the missing columns.
The cross join returns the Cartesian product of the two tables. It returns a row for each combination of the rows from the two tables.
Now, let's see how the query plan tree works. Let's build a sql query and see the query plan tree.
SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id;
This SQL can be represented as a query plan tree.
flowchart TD
A[Hash Join] --> B[Hash]
A --> C[Seq Scan orders]
B --> D[Seq Scan customers]
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#bfb,stroke:#333
style D fill:#bfb,stroke:#333
If you run this query, you will see the query plan tree.
EXPLAIN ANALYZE SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id;
-- Output:
Hash Join (cost=33.90..125.60 rows=2564 width=37)
Hash Cond: (orders.customer_id = customers.id)
-> Hash Join (cost=30.77..115.17 rows=2564 width=26)
Hash Cond: (line_items.order_id = orders.id)
-> Seq Scan on line_items
For each join type, the data flow is different. Let's see how the data flows through the joins.
The nested loop join is the simplest join type. It reads the outer table and for each row, it reads the inner table and checks if there is a match.
flowchart LR
subgraph "Nested Loop Join"
A[Outer Table] --> B[For each row]
B --> C[Scan Inner]
C --> D[Match?]
D -->|Yes| E[Output Row]
D -->|No| B
end
subgraph "Hash Join"
F[Build Table] --> G[Create Hash Table]
H[Probe Table] --> I[Hash Lookup]
G --> I
I --> J[Match?]
J -->|Yes| K[Output Row]
end
The index is a data structure that allows PostgreSQL to quickly locate the rows that match the query conditions.
Sometimes the index is not used because the query planner thinks that the sequential scan is faster.
flowchart TD
A[Query with WHERE] --> B{Index Available?}
B -->|Yes| C{Selective?}
B -->|No| D[Sequential Scan]
C -->|Yes| E[Index Scan]
C -->|No| D
style D fill:#f99,stroke:#333
style E fill:#9f9,stroke:#333
From SQL to results, there are several steps. Let's see how the query execution flow works.
flowchart TD
A[SQL Query] --> B[Parser]
B --> C[Rewriter]
C --> D[Planner/Optimizer]
D --> E[Executor]
E --> F[Results]
subgraph Planning
D --> D1[Generate Plans]
D1 --> D2[Cost Estimation]
D2 --> D3[Plan Selection]
end
subgraph Execution
E --> E1[Execute Plan]
E1 --> F[Results]
end
The EXPLAIN ANALYZE is a command that shows the query plan and the actual execution time. It exposes the query plan and the actual execution time.
flowchart LR
A[Query] --> B[EXPLAIN ANALYZE]
B --> C{Node Types}
C --> D[Seq Scan]
C --> E[Index Scan]
C --> F[Hash Join]
C --> G[Nested Loop]
subgraph Metrics
M1[Cost] --> M2[Actual Time]
M2 --> M3[Rows]
M3 --> M4[Buffers]
end
Think of EXPLAIN ANALYZE as PostgreSQL's built-in GPS navigation system for your queries. Just like a GPS calculates multiple routes and picks the optimal path based on traffic, distance, and road conditions, EXPLAIN ANALYZE:
- Evaluates multiple possible execution paths for your query
- Estimates the "cost" of each path based on table statistics, indexes, and data distribution
- Chooses the path it believes will be fastest
- Actually executes the query and shows you real timing data
Key benefits:
- Helps identify slow queries and bottlenecks
- Shows exactly how PostgreSQL executes your query
- Provides actual vs. estimated statistics to improve query planning
- Reveals opportunities for adding indexes or restructuring queries
Interactive example:
SELECT * FROM customers WHERE country = 'USA';
-- EXPLAIN output components:
Seq Scan on customers (cost=0.00..2.62 rows=10 width=8)
Filter: ((country)::text = 'USA'::text)
Rows Removed by Filter: 40
Key components to understand:
- Node Type (e.g.,
Seq Scan
): How PostgreSQL accesses the data - Cost: Estimated processing cost (first number is startup, second is total)
- Rows: Estimated number of rows to be processed
- Width: Estimated average width of rows in bytes
Reading the output of EXPLAIN ANALYZE can be overwhelming at first. Don't worry if you don't understand the output yet. We'll cover each component in detail in the next sections.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM customers WHERE country = 'USA';
-- Output includes:
Seq Scan on customers (cost=0.00..2.62 rows=10 width=8) (actual time=0.004..0.007 rows=10 loops=1)
Filter: ((country)::text = 'USA'::text)
Rows Removed by Filter: 40
Buffers: shared hit=2
Additional metrics:
- actual time: Real execution time (ms) - startup time..total time
- rows: Actual number of rows processed
- loops: Number of times this node was executed
- Buffers: Memory/disk page access statistics
The Access Methods are the different ways PostgreSQL can access the data. Each method has its own cost and performance characteristics. The query planner will choose the best access method based on the query and the table statistics.
flowchart TD
A[Table Access Methods] --> B[Sequential Scan]
A --> C[Index Scan]
A --> D[Index Only Scan]
B --> B1[Full Table Scan]
C --> C1[B-tree Index]
C --> C2[Hash Index]
D --> D1[Covering Index]
A sequential scan reads the entire table from start to finish by scanning each page sequentially. While this seems inefficient, it's actually optimal when:
- Reading a large portion of the table (>5-10% of rows)
- The table is small enough to fit in memory
- No suitable indexes exist
An index scan uses an index to quickly locate the rows that match the query conditions. It's efficient when:
- The query filters on indexed columns
- The index covers the columns needed in the query
An index only scan uses an index to quickly locate the rows that match the query conditions. It's efficient when:
- The query filters on indexed columns
- The index includes all needed columns
A bitmap index scan uses a bitmap to quickly locate the rows that match the query conditions. It's efficient when:
- The query filters on indexed columns
- The index includes all needed columns
flowchart LR
A[Join Types] --> B[Nested Loop]
A --> C[Hash Join]
A --> D[Merge Join]
B --> B1[Small Tables]
C --> C1[Large Tables]
D --> D1[Sorted Data]
-- Good for small tables or when joining with highly selective conditions
SELECT c.name, o.id
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.country = 'USA';
Example plan:
Nested Loop (cost=0.28..16.32 rows=10 width=24)
-> Seq Scan on customers c (cost=0.00..2.62 rows=10 width=16)
Filter: (country = 'USA'::text)
-> Index Scan using idx_orders_customer_id on orders o (cost=0.28..1.37 rows=1 width=16)
Index Cond: (customer_id = c.id)
-- Better for larger tables when joining on equality conditions
SELECT c.name, o.id, li.product_name
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN line_items li ON o.id = li.order_id;
Example plan:
Hash Join (cost=33.90..125.60 rows=2564 width=37)
Hash Cond: (orders.customer_id = customers.id)
-> Hash Join (cost=30.77..115.17 rows=2564 width=26)
Hash Cond: (line_items.order_id = orders.id)
-> Seq Scan on line_items
-> Hash
-> Seq Scan on orders
-> Hash
-> Seq Scan on customers
flowchart TD
A[Advanced Features] --> B[Window Functions]
A --> C[CTEs]
A --> D[Lateral Joins]
B --> B1[ROW_NUMBER]
B --> B2[Running Totals]
C --> C1[Recursive]
C --> C2[Non-Recursive]
D --> D1[Top-N per Group]
From advanced_queries.rb
:
SELECT
orders.*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at) as order_sequence
FROM orders;
Key components in plan:
- WindowAgg node
- Sorting operations for PARTITION BY and ORDER BY
- Memory usage for window frame
From advanced_queries.rb
:
WITH RECURSIVE order_chain AS (
-- Base case
SELECT o.id, o.customer_id, o.created_at, 1 as chain_length
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE c.country = 'USA'
UNION ALL
-- Recursive case
SELECT o.id, o.customer_id, o.created_at, oc.chain_length + 1
FROM orders o
JOIN order_chain oc ON o.customer_id = oc.customer_id
WHERE o.created_at BETWEEN oc.created_at AND oc.created_at + INTERVAL '7 days'
AND o.id > oc.id
)
SELECT customer_id, MAX(chain_length) as longest_chain
FROM order_chain
GROUP BY customer_id
HAVING MAX(chain_length) > 1;
flowchart TD
A[Optimization Areas] --> B[Indexes]
A --> C[Query Structure]
A --> D[Data Access]
B --> B1[B-tree]
B --> B2[Hash]
B --> B3[GiST]
C --> C1[Join Order]
C --> C2[Predicate Push-down]
D --> D1[Buffer Cache]
D --> D2[Sequential vs Random]
From query_optimization_lab.rb
:
def create_indexes
connection = ActiveRecord::Base.connection
# Add indexes if they don't exist
unless index_exists?(:customers, :country)
connection.add_index :customers, :country
end
unless index_exists?(:orders, :customer_id)
connection.add_index :orders, :customer_id
end
end
From query_optimization_lab.rb
:
def exercise_2_join_optimization
# Query 1: Simple JOIN
query1 = Order.joins(:customer)
.where(customers: { country: 'USA' })
# Query 2: Multiple JOINs with optimization
query2 = Order.joins(:customer, :line_items)
.where(customers: { country: 'USA' })
.group('orders.id')
.select('orders.*, COUNT(line_items.id) as items_count')
end
-
Basic Query Analysis (Exercise 1)
-- Simple WHERE vs IN condition WHERE country = 'USA' -- Estimated 10 rows WHERE country IN ('USA', 'Canada') -- Estimated 20 rows
Key Observations:
- Simple equality (
=
) shows better performance (~0.013ms) than IN clause (~0.012ms) - Both use Sequential Scan due to small table size
- Row estimation is accurate (10 and 20 rows respectively)
- Buffer hits are minimal (2 shared hits) indicating good memory usage
- Simple equality (
-
JOIN Optimization (Exercise 2)
-- Simple JOIN vs Multiple JOINs with aggregation SELECT orders.* FROM orders JOIN customers ... -- Hash Join strategy SELECT orders.*, COUNT(line_items.id) as items_count -- HashAggregate + Hash Join
Key Learnings:
- Simple JOIN: Uses Hash Join strategy (0.122ms)
- Complex JOIN: Uses HashAggregate with multiple Hash Joins (0.679ms)
- Memory usage increases with join complexity (9kB → 57kB)
- Proper index usage on customer_id reduces lookup time
-
Aggregation Optimization (Exercise 3)
-- Simple vs Complex aggregation GROUP BY customer_id -- HashAggregate GROUP BY customer_id HAVING SUM(...) > 1000 -- GroupAggregate with Sort
Notable Points:
- Simple aggregation uses HashAggregate (0.129ms)
- Complex aggregation requires sorting (1.849ms)
- Memory usage varies (24kB vs 214kB)
- HAVING clause adds significant overhead
-
Subquery Optimization (Exercise 4)
-- Subquery vs JOIN approach WHERE id IN (SELECT ...) -- Hash Join: 1.150ms JOIN ... GROUP BY ... HAVING ... -- HashAggregate: 0.248ms
Important Findings:
- JOIN approach is ~4.6x faster than subquery
- Memory usage is more efficient with JOIN
- Row estimation is more accurate with JOIN
- Fewer buffer hits in JOIN approach (10 vs 63)
-
Window Functions
-- Row numbers and running totals ROW_NUMBER() OVER (PARTITION BY ...) -- WindowAgg: 0.526ms SUM(...) OVER (PARTITION BY ... ORDER BY) -- WindowAgg: 4.018ms
Key Points:
- Simple window functions are fast (0.526ms)
- Adding ORDER BY increases complexity significantly
- Memory usage increases with window frame size
- Buffer hits scale with data size
-
Complex Aggregations
-- Customer statistics vs Monthly statistics GROUP BY customers.id -- GroupAggregate: 1.596ms GROUP BY DATE_TRUNC('month', ...) -- GroupAggregate: 5.329ms
Performance Notes:
- Date functions add overhead
- Memory usage increases with group complexity
- Sort operations impact performance
- Buffer hits remain consistent
-
Recursive CTE
WITH RECURSIVE order_chain AS ( -- Base case + Recursive case )
Implementation Insights:
- Fast execution (0.323ms) for chain analysis
- Efficient memory usage (40kB)
- Good filter pushdown
- Minimal buffer hits (26)
-
LATERAL JOIN
-- Top N per group pattern CROSS JOIN LATERAL (SELECT ... LIMIT 3)
Performance Considerations:
- Highest execution time (80.625ms)
- Heavy buffer usage (1734 hits)
- Memory efficient per iteration (25kB)
- Good for top-N queries despite higher execution time
Each query plan has different components. Let's see how to read the query plan.
-
Cost Components
The goal of the query planner is to find the best way to execute the query. The cost is the estimated time to execute the query. The number in the cost is the estimated time in milliseconds. It was adopted long time ago when spinning disks were the main storage.
(cost=0.00..2.62 rows=10 width=40) | | | |_ Average row width in bytes | | |_ Estimated number of rows | |_ Total cost |_ Startup cost
-
Actual vs Estimated
The actual time is the actual time it took to execute the query. The number in the actual time is the actual time in milliseconds.
(actual time=0.006..0.009 rows=10 loops=1) | | | | |_ Number of iterations | | | |_ Actual rows returned | | |_ Total time (ms) |_ Startup time (ms)
-
Buffer Statistics
The buffer statistics show how many times the data was found in the buffer cache. While the buffer cache is a good place to store the data, it can be a problem if the data is not used. It's better to have a low number of buffer hits.
Buffers: shared hit=2 | | |_ Number of buffer hits | |_ Buffer type (shared, local, temp)
-
Memory Usage
The memory usage shows how much memory was used to execute the query. The number in the memory usage is the memory in kilobytes. It's also good when it's low and does not spill to disk.
Memory Usage: 24kB Buckets: 1024 Batches: 1
While the query plan is a good way to understand the query execution, it's not the only thing to consider. There are some common performance patterns that can help you to understand the query execution.
-
Sequential vs Index Scan
- Small tables (< 1000 rows): Sequential Scan is often faster
- Large tables with selective conditions: Index Scan preferred
- Full table reads: Sequential Scan is more efficient
-
Join Strategies
- Hash Join: Large tables, equality conditions
- Nested Loop: Small tables or very selective conditions
- Merge Join: Pre-sorted data or when result must be sorted
-
Aggregation Methods
- HashAggregate: Faster but more memory intensive
- GroupAggregate: Used when sorting is required
- Plain Aggregate: Single group or very small groups
-
Memory Management
- Small operations: Memory-based processing
- Large sorts: May spill to disk
- Hash tables: Size depends on row count and width
-
High Buffer Hits
- Normal: < 100 for simple queries
- Warning: > 1000 for simple operations
- Investigate: Repeated table scans
-
Execution Time Spikes
- Simple queries: Should be < 1ms
- Complex joins: < 10ms
- Analytical queries: < 100ms
- Investigate if significantly higher
-
Row Estimation Errors
- Good: Estimated rows within 20% of actual
- Warning: Order of magnitude difference
- Action: Update table statistics
-
Memory Usage
- Normal: < 100kB for simple operations
- Warning: > 1MB for simple queries
- Monitor: Spill to disk operations
- Run Basic Examples:
ruby examples/03_queries/practice_queries.rb
- Run Optimization Lab:
ruby examples/03_queries/query_optimization_lab.rb
- Run Advanced Queries:
ruby examples/03_queries/advanced_queries.rb
- PostgreSQL Official Documentation - Using EXPLAIN
- PostgreSQL Official Documentation - Performance Tips
- Understanding EXPLAIN ANALYZE Output
- Index Types
When PostgreSQL can't fit operations in memory (controlled by work_mem
), it spills to disk. Here's how to interpret disk usage in EXPLAIN ANALYZE output:
-
Sort Operations
Sort Method: quicksort Memory: 46kB -- vs -- Sort Method: external merge Disk: 6360kB
quicksort
: Operation fits in memoryexternal merge
: Had to use disk (slower)Disk: NNNkB
: Amount of disk space used
-
Hash Operations
Buckets: 1024 Batches: 1 Memory Usage: 24kB -- vs -- Buckets: 2048 Batches: 2 Memory Usage: 113kB
Batches: 1
: Hash table fits in memoryBatches: > 1
: Hash table split across disk- More buckets = more memory used
-
Buffer Usage Types
Buffers: shared hit=7 read=91 -- vs -- Buffers: shared hit=1082, temp read=3240 written=3472
shared hit
: Found in PostgreSQL's buffer cacheshared read
: Read from disk into cachetemp read/written
: Temporary files used for disk operations
-
Memory vs Disk Performance
Planning Time: 0.164 ms Execution Time: 1.894 ms -- vs -- Planning Time: 0.164 ms Execution Time: 108.320 ms
- Memory operations: typically milliseconds
- Disk operations: can be 10-100x slower
- Watch for large differences between estimated and actual times
-
Large Sorts
-- Memory-intensive sort ORDER BY complex_expression
Watch for:
Sort Method: external merge
- High
temp read/written
values - Multiple sort operations
-
Complex Aggregations
-- Memory-intensive aggregation GROUP BY multiple_columns HAVING complex_conditions
Signs of disk usage:
HashAggregate
with multiple batches- High
temp read/written
values - Large differences between estimated and actual rows
-
String Aggregation
-- Memory-intensive string operation STRING_AGG(column, delimiter)
Indicators:
- Large
Memory Usage
values - Multiple batches in hash operations
- High buffer read/write counts
- Large
-
Multiple Joins
-- Memory-intensive joins FROM table1 JOIN table2 ON ... JOIN table3 ON ...
Look for:
- Multiple hash operations
- Increased batches per hash
- High temporary file usage
-
Work Memory Settings
-- Check current work_mem SHOW work_mem; -- Set session work_mem SET work_mem = '100MB';
- Default is often too low for complex queries
- Increase gradually while monitoring performance
- Consider per-query adjustment
-
Maintenance Operations
-- Update table statistics ANALYZE table_name; -- Clean up bloat VACUUM table_name;
- Keep statistics current
- Reduce table bloat
- Regular maintenance reduces disk usage
-
Query Optimization
-- Instead of SELECT DISTINCT complex_expression -- Consider SELECT complex_expression GROUP BY columns
- Minimize sorts where possible
- Use indexes effectively
- Consider materialized views for complex aggregations
-
Monitoring Disk Impact
-- Check temporary file usage SELECT * FROM pg_stat_database;
Key metrics:
temp_files
: Number of temp files createdtemp_bytes
: Total bytes written to temp filesblk_read_time
: Time spent reading blocksblk_write_time
: Time spent writing blocks
From our practice queries, here's a real-world comparison:
-
Memory-Based Query
SELECT customers.country, COUNT(DISTINCT orders.id) FROM orders JOIN customers ON ... GROUP BY customers.country
Results:
- Execution time: 1.894 ms
- Buffer hits: 98 (7 hits + 91 reads)
- Sort method: quicksort
- Memory usage: 46kB
-
Disk-Based Query
SELECT customers.country, STRING_AGG(DISTINCT customers.name, ', '), DATE_TRUNC('month', orders.created_at) FROM orders JOIN customers ON ... GROUP BY ...
Results:
- Execution time: 108.320 ms (57x slower)
- Buffer operations: 7794 (1082 hits + 3240 reads + 3472 writes)
- Sort method: external merge
- Disk usage: 6360kB
- Temp files: read=3240 written=3472
-
Memory vs Disk Tradeoffs
- Memory operations: Fast but limited
- Disk operations: Slower but scalable
- Balance based on data size and complexity
-
Warning Signs
external merge
in sort operations- Multiple batches in hash operations
- High temp file usage
- Large execution time differences
-
Optimization Strategies
- Increase
work_mem
for complex queries - Use appropriate indexes
- Minimize unnecessary sorting/grouping
- Consider materialized views
- Increase
-
Monitoring Tips
- Watch for unexpected disk operations
- Monitor temp file usage
- Compare estimated vs actual rows
- Track buffer hit ratios