PostgreSQL Query Optimization Workshop

This section focuses on understanding and optimizing PostgreSQL queries through practical examples and real-world scenarios. For definitions of terms used in this module, refer to our Glossary.

Prerequisites

Before starting this module, ensure you understand:

Query Plan
Index
Statistics
Transaction Management

Related Concepts

MVCC
Buffer Management
Storage Layout

Workshop Files

This workshop consists of the following files:

query_explorer.rb
- Core utility for analyzing query execution plans
- Provides colorized and formatted EXPLAIN output
- Used by all other examples
practice_queries.rb
- Basic query examples
- Memory vs disk usage demonstrations
- Real-world performance comparisons
query_optimization_lab.rb
- Structured exercises for query optimization
- Progressive complexity in examples
- Hands-on optimization practice
advanced_queries.rb
- Complex query patterns
- Window Functions
- Recursive CTEs
- LATERAL joins

Learning Path

1. Understanding Basic Concepts

Start with query_explorer.rb to understand:

How EXPLAIN works
Reading execution plans
Understanding query costs
Buffer and timing statistics

2. Basic Query Optimization

Work through practice_queries.rb:

Run the basic example to see a well-optimized query
Observe the memory-based execution
Run the disk usage example to understand:
- How PostgreSQL handles memory constraints
- Impact of work_mem on performance
- Disk vs memory operation differences
```
ruby examples/03_queries/practice_queries.rb
```

3. Structured Optimization Exercises

Complete the exercises in query_optimization_lab.rb:

Basic query analysis
JOIN optimization
Aggregate Functions optimization

Subquery optimization

ruby examples/03_queries/query_optimization_lab.rb

4. Advanced Query Patterns

Explore complex patterns in advanced_queries.rb:

Window functions
Recursive CTEs
LATERAL joins

Complex aggregations

ruby examples/03_queries/advanced_queries.rb

Key Concepts by Example

Basic Query Optimization

From practice_queries.rb:

-- Memory-efficient query
SELECT customers.country, COUNT(DISTINCT orders.id)
FROM orders
JOIN customers ON ...
GROUP BY customers.country

-- vs Memory-intensive query
SELECT customers.country,
       STRING_AGG(DISTINCT customers.name, ', '),
       DATE_TRUNC('month', orders.created_at)
...

Join Optimization

From query_optimization_lab.rb:

-- Simple JOIN
SELECT orders.* 
FROM orders 
JOIN customers ON ...

-- vs Complex JOIN
SELECT orders.*, COUNT(line_items.id)
FROM orders 
JOIN customers ON ...
JOIN line_items ON ...
GROUP BY orders.id

Advanced Patterns

From advanced_queries.rb:

-- Window Function
SELECT orders.*, 
       ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at)
FROM orders

-- Recursive CTE
WITH RECURSIVE order_chain AS (
  SELECT ... -- Base case
  UNION ALL
  SELECT ... -- Recursive case
)

Performance Monitoring

Using query_explorer.rb

QueryExplorer.analyze_query(sql)

Key metrics to watch:

Planning vs execution time
Buffer usage patterns
Sort methods and memory usage
Row estimation accuracy

For more details on monitoring, see:

EXPLAIN ANALYZE
Statistics

Next Steps

After completing this module, proceed to:

TimescaleDB Extension to learn about time-series optimization
Ruby Performance to understand Ruby-specific optimizations

Basic PostgreSQL Concepts

Relations

In PostgreSQL, a relation is a fundamental database object that represents a set of data. The two main types of relations are:

Tables: Persistent relations that store data in rows and columns
Views: Virtual relations defined by a query
Materialized Views: Cached result sets of a query that can be periodically refreshed

Operators

PostgreSQL provides several types of operators:

Comparison Operators:
- =, <>, <, >, <=, >=
- BETWEEN, IN, LIKE, ILIKE
Logical Operators:
- AND, OR, NOT
Mathematical Operators:
- +, -, *, /, %
- ^, |, &, <<, >>
Set Operators:
- UNION, INTERSECT, EXCEPT

JOIN Operations

JOINs are fundamental operations in relational databases that combine rows from two or more tables based on a related column between them.

INNER JOIN:
- Returns only the matching rows between tables
- Most common type of join
- Example: SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id
LEFT (OUTER) JOIN:
- Returns all rows from the left table and matching rows from the right table
- Non-matching rows filled with NULL
- Example: SELECT * FROM customers LEFT JOIN orders ON customers.id = orders.customer_id
RIGHT (OUTER) JOIN:
- Returns all rows from the right table and matching rows from the left table
- Non-matching rows filled with NULL
- Example: SELECT * FROM orders RIGHT JOIN customers ON orders.customer_id = customers.id
FULL (OUTER) JOIN:
- Returns all rows from both tables
- Non-matching rows filled with NULL
- Example: SELECT * FROM orders FULL JOIN customers ON orders.customer_id = customers.id
CROSS JOIN:
- Returns Cartesian product of both tables
- Every row from first table paired with every row from second table
- Example: SELECT * FROM sizes CROSS JOIN colors
NATURAL JOIN:
- Automatically joins tables using columns with the same name
- Implicitly matches ALL columns with same names
- Eliminates duplicate columns in the output
- Example: SELECT * FROM orders NATURAL JOIN order_items
- ⚠️ Warning: Use with caution as it can lead to unexpected results if:
  - Column names change
  - New columns with same names are added
  - Tables have multiple common column names

Data Types

Numeric Types:
- INTEGER, BIGINT, SMALLINT
- DECIMAL, NUMERIC, REAL, DOUBLE PRECISION
Character Types:
- CHAR, VARCHAR, TEXT
Date/Time Types:
- DATE, TIME, TIMESTAMP
- INTERVAL
Special Types:
- BOOLEAN
- UUID
- JSON, JSONB
- ARRAY

Schema Objects

Tables: Basic structure for data storage
Indexes: Improve query performance
Constraints: Enforce data integrity
- PRIMARY KEY
- FOREIGN KEY
- UNIQUE
- CHECK
- NOT NULL

Transaction Properties (ACID)

Atomicity: All operations in a transaction succeed or all fail
Consistency: Database remains in a valid state
Isolation: Concurrent transactions don't interfere
Durability: Committed changes are permanent

Visual Query Workflows

Query Processing Pipeline

flowchart LR
    A[SQL Query] --> B[Parser]
    B --> C[Rewriter]
    C --> D[Optimizer]
    D --> E[Executor]
    E --> F[Results]

    subgraph "1. Parsing"
        B --> B1[Syntax Check]
        B1 --> B2[Create Parse Tree]
    end

    subgraph "2. Rewriting"
        C --> C1[View Expansion]
        C1 --> C2[Rule Application]
    end

    subgraph "3. Optimization"
        D --> D1[Generate Plans]
        D1 --> D2[Cost Estimation]
        D2 --> D3[Plan Selection]
    end

Loading

JOIN Visualization

Every join operation has its own cost and performance characteristics. Let's see how each join type works.

LEFT JOIN

The left join returns all rows from the left table and the matching rows from the right table. If there is no match, the result will have NULL values for the right table columns.

INNER JOIN

The inner join returns only the matching rows between the two tables. If there is no match, the row is not included in the result.

FULL OUTER JOIN

The full outer join returns all rows from both tables. If there is no match, the result will have NULL values for the missing columns.

CROSS JOIN

The cross join returns the Cartesian product of the two tables. It returns a row for each combination of the rows from the two tables.

Query Plan Tree

Now, let's see how the query plan tree works. Let's build a sql query and see the query plan tree.

SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id;

This SQL can be represented as a query plan tree.

flowchart TD
    A[Hash Join] --> B[Hash]
    A --> C[Seq Scan orders]
    B --> D[Seq Scan customers]
    
    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bfb,stroke:#333
    style D fill:#bfb,stroke:#333

Loading

If you run this query, you will see the query plan tree.

EXPLAIN ANALYZE SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id;

-- Output:
Hash Join  (cost=33.90..125.60 rows=2564 width=37)
  Hash Cond: (orders.customer_id = customers.id)
  ->  Hash Join  (cost=30.77..115.17 rows=2564 width=26)
        Hash Cond: (line_items.order_id = orders.id)
        ->  Seq Scan on line_items

Data Flow Through Joins

For each join type, the data flow is different. Let's see how the data flows through the joins.

Nested Loop Join

The nested loop join is the simplest join type. It reads the outer table and for each row, it reads the inner table and checks if there is a match.

flowchart LR
    subgraph "Nested Loop Join"
        A[Outer Table] --> B[For each row]
        B --> C[Scan Inner]
        C --> D[Match?]
        D -->|Yes| E[Output Row]
        D -->|No| B
    end

    subgraph "Hash Join"
        F[Build Table] --> G[Create Hash Table]
        H[Probe Table] --> I[Hash Lookup]
        G --> I
        I --> J[Match?]
        J -->|Yes| K[Output Row]
    end

Loading

Index Usage Patterns

The index is a data structure that allows PostgreSQL to quickly locate the rows that match the query conditions.

Sometimes the index is not used because the query planner thinks that the sequential scan is faster.

flowchart TD
    A[Query with WHERE] --> B{Index Available?}
    B -->|Yes| C{Selective?}
    B -->|No| D[Sequential Scan]
    C -->|Yes| E[Index Scan]
    C -->|No| D
    
    style D fill:#f99,stroke:#333
    style E fill:#9f9,stroke:#333

Loading

Query Execution Flow

From SQL to results, there are several steps. Let's see how the query execution flow works.

flowchart TD
    A[SQL Query] --> B[Parser]
    B --> C[Rewriter]
    C --> D[Planner/Optimizer]
    D --> E[Executor]
    E --> F[Results]
    
    subgraph Planning
        D --> D1[Generate Plans]
        D1 --> D2[Cost Estimation]
        D2 --> D3[Plan Selection]
    end

    subgraph Execution
        E --> E1[Execute Plan]
        E1 --> F[Results]
    end

Loading

EXPLAIN ANALYZE Workflow

The EXPLAIN ANALYZE is a command that shows the query plan and the actual execution time. It exposes the query plan and the actual execution time.

flowchart LR
    A[Query] --> B[EXPLAIN ANALYZE]
    B --> C{Node Types}
    C --> D[Seq Scan]
    C --> E[Index Scan]
    C --> F[Hash Join]
    C --> G[Nested Loop]
    
    subgraph Metrics
    M1[Cost] --> M2[Actual Time]
    M2 --> M3[Rows]
    M3 --> M4[Buffers]
    end

Loading

Learning Path

1. Understanding EXPLAIN ANALYZE Basics

Think of EXPLAIN ANALYZE as PostgreSQL's built-in GPS navigation system for your queries. Just like a GPS calculates multiple routes and picks the optimal path based on traffic, distance, and road conditions, EXPLAIN ANALYZE:

Evaluates multiple possible execution paths for your query
Estimates the "cost" of each path based on table statistics, indexes, and data distribution
Chooses the path it believes will be fastest
Actually executes the query and shows you real timing data

Key benefits:

Helps identify slow queries and bottlenecks
Shows exactly how PostgreSQL executes your query
Provides actual vs. estimated statistics to improve query planning
Reveals opportunities for adding indexes or restructuring queries

Interactive example:

SELECT * FROM customers WHERE country = 'USA';

-- EXPLAIN output components:
Seq Scan on customers  (cost=0.00..2.62 rows=10 width=8)
  Filter: ((country)::text = 'USA'::text)
  Rows Removed by Filter: 40

Key components to understand:

Node Type (e.g., Seq Scan): How PostgreSQL accesses the data
Cost: Estimated processing cost (first number is startup, second is total)
Rows: Estimated number of rows to be processed
Width: Estimated average width of rows in bytes

1.2 Reading Execution Statistics

Reading the output of EXPLAIN ANALYZE can be overwhelming at first. Don't worry if you don't understand the output yet. We'll cover each component in detail in the next sections.

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM customers WHERE country = 'USA';

-- Output includes:
Seq Scan on customers (cost=0.00..2.62 rows=10 width=8) (actual time=0.004..0.007 rows=10 loops=1)
  Filter: ((country)::text = 'USA'::text)
  Rows Removed by Filter: 40
  Buffers: shared hit=2

Additional metrics:

actual time: Real execution time (ms) - startup time..total time
rows: Actual number of rows processed
loops: Number of times this node was executed
Buffers: Memory/disk page access statistics

2. Common Access Methods

The Access Methods are the different ways PostgreSQL can access the data. Each method has its own cost and performance characteristics. The query planner will choose the best access method based on the query and the table statistics.

flowchart TD
    A[Table Access Methods] --> B[Sequential Scan]
    A --> C[Index Scan]
    A --> D[Index Only Scan]
    
    B --> B1[Full Table Scan]
    C --> C1[B-tree Index]
    C --> C2[Hash Index]
    D --> D1[Covering Index]

Loading

2.1 Sequential Scan

A sequential scan reads the entire table from start to finish by scanning each page sequentially. While this seems inefficient, it's actually optimal when:

Reading a large portion of the table (>5-10% of rows)
The table is small enough to fit in memory
No suitable indexes exist

2.2 Index Scan

An index scan uses an index to quickly locate the rows that match the query conditions. It's efficient when:

The query filters on indexed columns
The index covers the columns needed in the query

2.3 Index Only Scan

An index only scan uses an index to quickly locate the rows that match the query conditions. It's efficient when:

The query filters on indexed columns
The index includes all needed columns

2.4 Bitmap Index Scan

A bitmap index scan uses a bitmap to quickly locate the rows that match the query conditions. It's efficient when:

The query filters on indexed columns
The index includes all needed columns

3. Join Operations

flowchart LR
    A[Join Types] --> B[Nested Loop]
    A --> C[Hash Join]
    A --> D[Merge Join]
    
    B --> B1[Small Tables]
    C --> C1[Large Tables]
    D --> D1[Sorted Data]

Loading

3.1 Nested Loop Join

-- Good for small tables or when joining with highly selective conditions
SELECT c.name, o.id 
FROM customers c 
JOIN orders o ON c.id = o.customer_id 
WHERE c.country = 'USA';

Example plan:

Nested Loop  (cost=0.28..16.32 rows=10 width=24)
  ->  Seq Scan on customers c  (cost=0.00..2.62 rows=10 width=16)
        Filter: (country = 'USA'::text)
  ->  Index Scan using idx_orders_customer_id on orders o  (cost=0.28..1.37 rows=1 width=16)
        Index Cond: (customer_id = c.id)

3.2 Hash Join

-- Better for larger tables when joining on equality conditions
SELECT c.name, o.id, li.product_name
FROM customers c 
JOIN orders o ON c.id = o.customer_id
JOIN line_items li ON o.id = li.order_id;

Example plan:

Hash Join  (cost=33.90..125.60 rows=2564 width=37)
  Hash Cond: (orders.customer_id = customers.id)
  ->  Hash Join  (cost=30.77..115.17 rows=2564 width=26)
        Hash Cond: (line_items.order_id = orders.id)
        ->  Seq Scan on line_items
        ->  Hash
            ->  Seq Scan on orders
  ->  Hash
      ->  Seq Scan on customers

4. Advanced Operations

flowchart TD
    A[Advanced Features] --> B[Window Functions]
    A --> C[CTEs]
    A --> D[Lateral Joins]
    
    B --> B1[ROW_NUMBER]
    B --> B2[Running Totals]
    C --> C1[Recursive]
    C --> C2[Non-Recursive]
    D --> D1[Top-N per Group]

Loading

4.1 Window Functions

From advanced_queries.rb:

SELECT 
  orders.*, 
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at) as order_sequence
FROM orders;

Key components in plan:

WindowAgg node
Sorting operations for PARTITION BY and ORDER BY
Memory usage for window frame

4.2 Recursive CTEs

From advanced_queries.rb:

WITH RECURSIVE order_chain AS (
  -- Base case
  SELECT o.id, o.customer_id, o.created_at, 1 as chain_length
  FROM orders o
  JOIN customers c ON c.id = o.customer_id
  WHERE c.country = 'USA'
  
  UNION ALL
  
  -- Recursive case
  SELECT o.id, o.customer_id, o.created_at, oc.chain_length + 1
  FROM orders o
  JOIN order_chain oc ON o.customer_id = oc.customer_id
  WHERE o.created_at BETWEEN oc.created_at AND oc.created_at + INTERVAL '7 days'
    AND o.id > oc.id
)
SELECT customer_id, MAX(chain_length) as longest_chain
FROM order_chain
GROUP BY customer_id
HAVING MAX(chain_length) > 1;

5. Performance Optimization Tips

flowchart TD
    A[Optimization Areas] --> B[Indexes]
    A --> C[Query Structure]
    A --> D[Data Access]
    
    B --> B1[B-tree]
    B --> B2[Hash]
    B --> B3[GiST]
    
    C --> C1[Join Order]
    C --> C2[Predicate Push-down]
    
    D --> D1[Buffer Cache]
    D --> D2[Sequential vs Random]

Loading

5.1 Index Usage

From query_optimization_lab.rb:

def create_indexes
  connection = ActiveRecord::Base.connection
  
  # Add indexes if they don't exist
  unless index_exists?(:customers, :country)
    connection.add_index :customers, :country
  end

  unless index_exists?(:orders, :customer_id)
    connection.add_index :orders, :customer_id
  end
end

5.2 Join Optimization

From query_optimization_lab.rb:

def exercise_2_join_optimization
  # Query 1: Simple JOIN
  query1 = Order.joins(:customer)
    .where(customers: { country: 'USA' })
  
  # Query 2: Multiple JOINs with optimization
  query2 = Order.joins(:customer, :line_items)
    .where(customers: { country: 'USA' })
    .group('orders.id')
    .select('orders.*, COUNT(line_items.id) as items_count')
end

Learning Highlights from Examples

Query Optimization Lab Insights

Basic Query Analysis (Exercise 1)
```
-- Simple WHERE vs IN condition
WHERE country = 'USA'              -- Estimated 10 rows
WHERE country IN ('USA', 'Canada') -- Estimated 20 rows
```
Key Observations:
- Simple equality (=) shows better performance (~0.013ms) than IN clause (~0.012ms)
- Both use Sequential Scan due to small table size
- Row estimation is accurate (10 and 20 rows respectively)
- Buffer hits are minimal (2 shared hits) indicating good memory usage
JOIN Optimization (Exercise 2)
```
-- Simple JOIN vs Multiple JOINs with aggregation
SELECT orders.* FROM orders JOIN customers ...        -- Hash Join strategy
SELECT orders.*, COUNT(line_items.id) as items_count -- HashAggregate + Hash Join
```
Key Learnings:
- Simple JOIN: Uses Hash Join strategy (0.122ms)
- Complex JOIN: Uses HashAggregate with multiple Hash Joins (0.679ms)
- Memory usage increases with join complexity (9kB → 57kB)
- Proper index usage on customer_id reduces lookup time

Aggregation Optimization (Exercise 3)

-- Simple vs Complex aggregation
GROUP BY customer_id                              -- HashAggregate
GROUP BY customer_id HAVING SUM(...) > 1000      -- GroupAggregate with Sort

Notable Points:

Simple aggregation uses HashAggregate (0.129ms)
Complex aggregation requires sorting (1.849ms)
Memory usage varies (24kB vs 214kB)
HAVING clause adds significant overhead

Subquery Optimization (Exercise 4)
```
-- Subquery vs JOIN approach
WHERE id IN (SELECT ...)           -- Hash Join: 1.150ms
JOIN ... GROUP BY ... HAVING ...   -- HashAggregate: 0.248ms
```
Important Findings:
- JOIN approach is ~4.6x faster than subquery
- Memory usage is more efficient with JOIN
- Row estimation is more accurate with JOIN
- Fewer buffer hits in JOIN approach (10 vs 63)

Advanced Queries Analysis

Window Functions
```
-- Row numbers and running totals
ROW_NUMBER() OVER (PARTITION BY ...)     -- WindowAgg: 0.526ms
SUM(...) OVER (PARTITION BY ... ORDER BY) -- WindowAgg: 4.018ms
```
Key Points:
- Simple window functions are fast (0.526ms)
- Adding ORDER BY increases complexity significantly
- Memory usage increases with window frame size
- Buffer hits scale with data size

Complex Aggregations

-- Customer statistics vs Monthly statistics
GROUP BY customers.id              -- GroupAggregate: 1.596ms
GROUP BY DATE_TRUNC('month', ...) -- GroupAggregate: 5.329ms

Performance Notes:

Date functions add overhead
Memory usage increases with group complexity
Sort operations impact performance
Buffer hits remain consistent

Recursive CTE
```
WITH RECURSIVE order_chain AS (
  -- Base case + Recursive case
)
```
Implementation Insights:
- Fast execution (0.323ms) for chain analysis
- Efficient memory usage (40kB)
- Good filter pushdown
- Minimal buffer hits (26)
LATERAL JOIN
```
-- Top N per group pattern
CROSS JOIN LATERAL (SELECT ... LIMIT 3)
```
Performance Considerations:
- Highest execution time (80.625ms)
- Heavy buffer usage (1734 hits)
- Memory efficient per iteration (25kB)
- Good for top-N queries despite higher execution time

Query Plan Reading Tips

Each query plan has different components. Let's see how to read the query plan.

Cost Components

The goal of the query planner is to find the best way to execute the query. The cost is the estimated time to execute the query. The number in the cost is the estimated time in milliseconds. It was adopted long time ago when spinning disks were the main storage.
```
(cost=0.00..2.62 rows=10 width=40)
|     |     |     |_ Average row width in bytes
|     |     |_ Estimated number of rows
|     |_ Total cost
|_ Startup cost
```

Actual vs Estimated

The actual time is the actual time it took to execute the query. The number in the actual time is the actual time in milliseconds.

(actual time=0.006..0.009 rows=10 loops=1)
|        |     |     |     |_ Number of iterations
|        |     |     |_ Actual rows returned
|        |     |_ Total time (ms)
|_ Startup time (ms)

Buffer Statistics

The buffer statistics show how many times the data was found in the buffer cache. While the buffer cache is a good place to store the data, it can be a problem if the data is not used. It's better to have a low number of buffer hits.
```
Buffers: shared hit=2
|        |     |_ Number of buffer hits
|        |_ Buffer type (shared, local, temp)
```
Memory Usage

The memory usage shows how much memory was used to execute the query. The number in the memory usage is the memory in kilobytes. It's also good when it's low and does not spill to disk.
```
Memory Usage: 24kB
Buckets: 1024
Batches: 1
```

Common Performance Patterns

While the query plan is a good way to understand the query execution, it's not the only thing to consider. There are some common performance patterns that can help you to understand the query execution.

Sequential vs Index Scan
- Small tables (< 1000 rows): Sequential Scan is often faster
- Large tables with selective conditions: Index Scan preferred
- Full table reads: Sequential Scan is more efficient
Join Strategies
- Hash Join: Large tables, equality conditions
- Nested Loop: Small tables or very selective conditions
- Merge Join: Pre-sorted data or when result must be sorted
Aggregation Methods
- HashAggregate: Faster but more memory intensive
- GroupAggregate: Used when sorting is required
- Plain Aggregate: Single group or very small groups
Memory Management
- Small operations: Memory-based processing
- Large sorts: May spill to disk
- Hash tables: Size depends on row count and width

Performance Red Flags

High Buffer Hits
- Normal: < 100 for simple queries
- Warning: > 1000 for simple operations
- Investigate: Repeated table scans
Execution Time Spikes
- Simple queries: Should be < 1ms
- Complex joins: < 10ms
- Analytical queries: < 100ms
- Investigate if significantly higher
Row Estimation Errors
- Good: Estimated rows within 20% of actual
- Warning: Order of magnitude difference
- Action: Update table statistics
Memory Usage
- Normal: < 100kB for simple operations
- Warning: > 1MB for simple queries
- Monitor: Spill to disk operations

Running the Examples

Run Basic Examples:

ruby examples/03_queries/practice_queries.rb

Run Optimization Lab:

ruby examples/03_queries/query_optimization_lab.rb

Run Advanced Queries:

ruby examples/03_queries/advanced_queries.rb

Additional Resources

PostgreSQL Official Documentation - Using EXPLAIN
PostgreSQL Official Documentation - Performance Tips
Understanding EXPLAIN ANALYZE Output
Index Types

Understanding Disk Usage in Query Plans

When PostgreSQL can't fit operations in memory (controlled by work_mem), it spills to disk. Here's how to interpret disk usage in EXPLAIN ANALYZE output:

Sort Operations
```
Sort Method: quicksort  Memory: 46kB
-- vs --
Sort Method: external merge  Disk: 6360kB
```
- quicksort: Operation fits in memory
- external merge: Had to use disk (slower)
- Disk: NNNkB: Amount of disk space used
Hash Operations
```
Buckets: 1024  Batches: 1  Memory Usage: 24kB
-- vs --
Buckets: 2048  Batches: 2  Memory Usage: 113kB
```
- Batches: 1: Hash table fits in memory
- Batches: > 1: Hash table split across disk
- More buckets = more memory used
Buffer Usage Types
```
Buffers: shared hit=7 read=91
-- vs --
Buffers: shared hit=1082, temp read=3240 written=3472
```
- shared hit: Found in PostgreSQL's buffer cache
- shared read: Read from disk into cache
- temp read/written: Temporary files used for disk operations
Memory vs Disk Performance
```
Planning Time: 0.164 ms
Execution Time: 1.894 ms
-- vs --
Planning Time: 0.164 ms
Execution Time: 108.320 ms
```
- Memory operations: typically milliseconds
- Disk operations: can be 10-100x slower
- Watch for large differences between estimated and actual times

Common Disk Usage Patterns

Large Sorts
```
-- Memory-intensive sort
ORDER BY complex_expression
```
Watch for:
- Sort Method: external merge
- High temp read/written values
- Multiple sort operations
Complex Aggregations
```
-- Memory-intensive aggregation
GROUP BY multiple_columns
HAVING complex_conditions
```
Signs of disk usage:
- HashAggregate with multiple batches
- High temp read/written values
- Large differences between estimated and actual rows
String Aggregation
```
-- Memory-intensive string operation
STRING_AGG(column, delimiter)
```
Indicators:
- Large Memory Usage values
- Multiple batches in hash operations
- High buffer read/write counts
Multiple Joins
```
-- Memory-intensive joins
FROM table1
JOIN table2 ON ...
JOIN table3 ON ...
```
Look for:
- Multiple hash operations
- Increased batches per hash
- High temporary file usage

Optimizing Disk Usage

Work Memory Settings
```
-- Check current work_mem
SHOW work_mem;

-- Set session work_mem
SET work_mem = '100MB';
```
- Default is often too low for complex queries
- Increase gradually while monitoring performance
- Consider per-query adjustment
Maintenance Operations
```
-- Update table statistics
ANALYZE table_name;

-- Clean up bloat
VACUUM table_name;
```
- Keep statistics current
- Reduce table bloat
- Regular maintenance reduces disk usage
Query Optimization
```
-- Instead of
SELECT DISTINCT complex_expression
-- Consider
SELECT complex_expression GROUP BY columns
```
- Minimize sorts where possible
- Use indexes effectively
- Consider materialized views for complex aggregations
Monitoring Disk Impact
```
-- Check temporary file usage
SELECT * FROM pg_stat_database;
```
Key metrics:
- temp_files: Number of temp files created
- temp_bytes: Total bytes written to temp files
- blk_read_time: Time spent reading blocks
- blk_write_time: Time spent writing blocks

Performance Comparison Example

From our practice queries, here's a real-world comparison:

Memory-Based Query
```
SELECT customers.country, COUNT(DISTINCT orders.id)
FROM orders
JOIN customers ON ...
GROUP BY customers.country
```
Results:
- Execution time: 1.894 ms
- Buffer hits: 98 (7 hits + 91 reads)
- Sort method: quicksort
- Memory usage: 46kB
Disk-Based Query
```
SELECT customers.country,
       STRING_AGG(DISTINCT customers.name, ', '),
       DATE_TRUNC('month', orders.created_at)
FROM orders
JOIN customers ON ...
GROUP BY ...
```
Results:
- Execution time: 108.320 ms (57x slower)
- Buffer operations: 7794 (1082 hits + 3240 reads + 3472 writes)
- Sort method: external merge
- Disk usage: 6360kB
- Temp files: read=3240 written=3472

Key Takeaways

Memory vs Disk Tradeoffs
- Memory operations: Fast but limited
- Disk operations: Slower but scalable
- Balance based on data size and complexity
Warning Signs
- external merge in sort operations
- Multiple batches in hash operations
- High temp file usage
- Large execution time differences
Optimization Strategies
- Increase work_mem for complex queries
- Use appropriate indexes
- Minimize unnecessary sorting/grouping
- Consider materialized views
Monitoring Tips
- Watch for unexpected disk operations
- Monitor temp file usage
- Compare estimated vs actual rows
- Track buffer hit ratios

Files

README.md

Latest commit

History

README.md

File metadata and controls

PostgreSQL Query Optimization Workshop

Prerequisites

Related Concepts

Workshop Files

Learning Path

1. Understanding Basic Concepts

2. Basic Query Optimization

3. Structured Optimization Exercises

4. Advanced Query Patterns

Key Concepts by Example

Basic Query Optimization

Join Optimization

Advanced Patterns

Performance Monitoring

Using query_explorer.rb

Next Steps

Basic PostgreSQL Concepts

Relations

Operators

JOIN Operations

Data Types

Schema Objects

Transaction Properties (ACID)

Visual Query Workflows

Query Processing Pipeline

JOIN Visualization

LEFT JOIN

INNER JOIN

FULL OUTER JOIN

CROSS JOIN

Query Plan Tree

Data Flow Through Joins

Nested Loop Join

Index Usage Patterns

Query Execution Flow

EXPLAIN ANALYZE Workflow

Learning Path

1. Understanding EXPLAIN ANALYZE Basics

1.2 Reading Execution Statistics

2. Common Access Methods

2.1 Sequential Scan

2.2 Index Scan

2.3 Index Only Scan

2.4 Bitmap Index Scan

3. Join Operations

3.1 Nested Loop Join

3.2 Hash Join

4. Advanced Operations

4.1 Window Functions

4.2 Recursive CTEs

5. Performance Optimization Tips

5.1 Index Usage

5.2 Join Optimization

Learning Highlights from Examples

Query Optimization Lab Insights

Advanced Queries Analysis

Query Plan Reading Tips

Common Performance Patterns

Performance Red Flags

Running the Examples

Additional Resources

Understanding Disk Usage in Query Plans

Common Disk Usage Patterns

Optimizing Disk Usage

Performance Comparison Example

Key Takeaways