Introduction
Query optimization: transforming SQL into efficient execution. Goal: minimize resource consumption (CPU, I/O, memory). Automated: optimizer creates execution plans. Manual: developer provides hints. Combined: both approaches effective.
"Good database design sets the foundation, but query optimization determines performance. A single poorly written query can bring down an entire system, while elegant queries enable scale." -- Database performance
Query Optimizer
Purpose
Receives: SQL statement (logical description). Produces: execution plan (physical operations). Goal: find lowest-cost plan. Automatic: transparent to user. Intelligence: statistics, heuristics, cost models.
Optimizer Types
Cost-based: estimates resource cost (CPU, I/O), chooses cheapest. Rule-based: applies heuristics (indexed column, join order). Hybrid: combines both approaches. Modern: mostly cost-based.
Planning Process
1. Parse SQL: validate syntax, identify operations2. Analyze: determine equivalent logical plans3. Estimate: cost each possible plan4. Choose: select minimum-cost plan5. Execute: run selected plan with actual dataChallenges
Incomplete statistics: estimates inaccurate. Complex queries: many possible plans (exponential). Dynamic data: statistics become stale. Trade-offs: compilation time vs optimization quality.
Execution Plans
Plan Structure
Tree: hierarchical operations. Leaf nodes: table scans (base data). Internal nodes: operators (filter, join, sort). Root: final result. Bottom-up: data flows from leaves to root.
Common Operators
| Operator | Description | Cost |
|---|---|---|
| Table Scan | Read entire table sequentially | High (full scan) |
| Index Seek | Use index to find rows | Low (selective) |
| Filter | Apply WHERE conditions | Low (CPU only) |
| Nested Loop Join | Iterate outer, lookup inner | High (many lookups) |
| Hash Join | Build hash table, probe | Medium (memory) |
| Sort | Order results | High (N log N) |
Reading Plans
EXPLAIN SELECT e.name, d.dept_nameFROM employees eJOIN departments d ON e.dept_id = d.dept_idWHERE e.salary > 50000;-- Output: operation sequence-- Table Scan: departments (1000 rows)-- Index Seek: employees.salary > 50000 (500 rows)-- Hash Join: on dept_id-- Result: 500 rowsCost Estimation
Cost Model
I/O cost: disk reads/writes (dominant). CPU cost: processing (secondary). Memory: space for sorts, hash tables. Network: data transfer (distributed systems).
Formula
Total Cost = I/O_cost + CPU_cost + Memory_costI/O_cost = number_of_pages_accessed * page_read_costCPU_cost = number_of_rows_processed * cpu_per_row_costEstimation Inaccuracy
Correlated columns: independence assumption violated. Selectivity: estimated vs actual differs. Distribution: skewed data (non-uniform). Cardinality: row count estimates wrong.
Example
Estimate: query returns 100 rows (based on statistics). Actual: returns 10,000 rows. Optimizer chose plan for 100 rows (wrong). Plan: inefficient for 10,000 rows.
Query Rewriting
Equivalent Transformations
-- OriginalSELECT * FROM employees WHERE dept_id IN (SELECT dept_id FROM departments WHERE location = 'NY');-- Rewritten (JOIN)SELECT e.* FROM employees eJOIN departments d ON e.dept_id = d.dept_idWHERE d.location = 'NY';Common Rewrites
Predicate pushdown: move WHERE earlier in plan (reduce intermediate rows). Join order: reorder joins (different algorithms, costs). Aggregation: early partial aggregation (reduce data). Subquery: convert to join (more optimization opportunities).
Unnesting Subqueries
-- Correlated subquery (slow)SELECT e.name FROM employees eWHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);-- Rewritten (fast)SELECT e.name FROM employees eJOIN (SELECT dept_id, AVG(salary) as avg_sal FROM employees GROUP BY dept_id) d ON e.dept_id = d.dept_id AND e.salary > d.avg_sal;View Merging
-- ViewCREATE VIEW high_earners ASSELECT * FROM employees WHERE salary > 100000;-- Query using viewSELECT * FROM high_earners WHERE dept_id = 10;-- Merged (optimizer inlines view)SELECT * FROM employees WHERE salary > 100000 AND dept_id = 10;Index Strategies
Index Selection
Selectivity: how much data filtered (0-100%). Low selectivity: index inefficient (better scan). High selectivity: index efficient (few rows). Cardinality: number of distinct values.
Composite Indexes
-- QuerySELECT * FROM orders WHERE customer_id = 5 AND order_date > '2024-01-01';-- Index (customer_id, order_date)-- Efficient: uses both columns for filtering-- Leading edge: customer_id must be firstCovering Indexes
-- Query requires columns: id, name, salarySELECT id, name, salary FROM employees WHERE dept_id = 10;-- Covering indexCREATE INDEX idx_covering ON employees(dept_id) INCLUDE (name, salary);-- Entire result in index (no table lookup needed)Index Trade-offs
Advantage: faster reads (seek vs scan). Disadvantage: slower writes (insert/update/delete must update index). Space: extra disk storage. Maintenance: overhead for changes.
Join Optimization
Join Algorithms
Nested loop: simple, high cost (many I/O). Hash join: fast (large tables, equality). Sort-merge: efficient (both tables sorted). Index nested loop: use index (selective).
Join Order
-- Query: e JOIN d JOIN p-- Order 1: (e JOIN d) JOIN p-- Order 2: e JOIN (d JOIN p)-- Cost differs significantly (different intermediate sizes)Selectivity-Based Ordering
Join most selective first: reduces intermediate rows. Example: filter WHERE before joining. Effect: cascade reduction (each join produces fewer rows).
Join Hints
-- Force specific algorithmSELECT /*+ USE_HASH(e d) */ * FROM employees eJOIN departments d ON e.dept_id = d.dept_id;-- Force join orderSELECT /*+ ORDERED */ * FROM employees e, departments dWHERE e.dept_id = d.dept_id;Subquery Optimization
Correlated Subqueries
-- Problem: executes inner query for each rowSELECT e.name FROM employees eWHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);-- Rows: 1000, Inner queries: 1000 (expensive)Optimization Techniques
Unnesting: convert to join (single pass). Aggregation: compute once, join. EXISTS: better than IN for subqueries. Materialization: compute subquery once, reuse.
IN vs EXISTS
-- IN with subquerySELECT * FROM employees WHERE dept_id IN (SELECT dept_id FROM departments);-- EXISTS (better for large subqueries)SELECT * FROM employees eWHERE EXISTS (SELECT 1 FROM departments d WHERE e.dept_id = d.dept_id);Statistics and Histograms
Column Statistics
Cardinality: distinct values. Distribution: how values spread. Histograms: frequency per value range. Null count: NULL occurrences. Updated: periodically by DBMS.
Stale Statistics Impact
After bulk insert: statistics outdated. Query plan: based on old distribution. Result: suboptimal execution. Solution: refresh statistics after major changes.
Histogram Types
Equi-width: equal value ranges. Equi-height: equal row count per bucket. Compressed: value-only (no frequencies). Benefits: better selectivity estimation.
Maintenance
-- Refresh statisticsANALYZE TABLE employees COMPUTE STATISTICS;-- Specific columnANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS salary;-- View statisticsSELECT * FROM information_schema.statistics WHERE table_name = 'employees';Analysis and Tools
EXPLAIN Statement
EXPLAIN SELECT * FROM employees WHERE salary > 50000;Output:- Operation type (Seq Scan, Index Seek, etc.)- Number of rows- Row width- Estimated cost- Filter conditionsEXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT * FROM employees WHERE salary > 50000;Output:- Estimated vs actual rows- Estimated vs actual time- Buffering information- Identifies estimation errorsQuery Profiling
SET PROFILING = 1;SELECT * FROM employees WHERE dept_id = 10;SHOW PROFILES;SHOW PROFILE FOR QUERY 1;-- Shows: sending data, query init, opening tables, etc.Monitoring Tools
DBMS-native: built-in tools (SQL Server Query Store, MySQL Workbench). Third-party: specialized tools (SolarWinds, Redgate). Metrics: execution time, row count, wait events.
Tuning Techniques
Indexing Strategy
Analyze queries: find expensive operations. Add indexes: on frequently filtered/joined columns. Composite: consider together (leading edge matters). Remove: unused indexes (reduce write overhead).
Query Restructuring
Simplify: break complex queries into steps. Denormalize: cache computed values. Aggregate early: reduce rows before joins. Avoid functions: on columns (prevents index use).
Materialized Views
-- Expensive computationCREATE MATERIALIZED VIEW dept_salary_summary ASSELECT dept_id, COUNT(*) as emp_count, AVG(salary) as avg_salaryFROM employees GROUP BY dept_id;-- Query uses precomputed resultsSELECT * FROM dept_salary_summary WHERE dept_id = 10;Batching and Pagination
-- Avoid: fetch all rowsSELECT * FROM employees;-- Batch: fetch in chunksSELECT * FROM employees LIMIT 100 OFFSET 0;SELECT * FROM employees LIMIT 100 OFFSET 100;Common Anti-Patterns
Using Functions in WHERE
-- Bad: function prevents index useSELECT * FROM employees WHERE LOWER(name) = 'alice';-- Good: index worksSELECT * FROM employees WHERE name = 'Alice';NOT IN with NULL
-- Bad: returns nothing (NULL in list)SELECT * FROM employees WHERE dept_id NOT IN (10, 20, NULL);-- Good: use NOT EXISTSSELECT * FROM employees eWHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_id = e.dept_id);Implicit Type Conversion
-- Bad: string number prevents indexSELECT * FROM employees WHERE emp_id = '100';-- Good: proper typeSELECT * FROM employees WHERE emp_id = 100;Unnecessary Joins
-- Bad: joined but not neededSELECT e.name FROM employees eJOIN departments d ON e.dept_id = d.dept_id;-- Good: direct selectSELECT name FROM employees;Overfetching
-- Bad: get all columnsSELECT * FROM large_table;-- Good: specific columnsSELECT id, name FROM large_table;References
- Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
- Silberschatz, A., Korth, H. F., and Sudarshan, S. "Database System Concepts." McGraw-Hill, 6th edition, 2010.
- Garcia-Molina, H., Ullman, J. D., and Widom, J. "Database Systems: The Complete Book." Pearson, 2nd edition, 2008.
- Date, C. J., and Darwen, H. "Database Explorations." Trafford Publishing, 2010.
- Celko, J. "SQL Performance Tuning." Sams Publishing, 2nd edition, 2013.