Query Optimization (Indexing)

Introduction

Query optimization: transforming SQL into efficient execution. Goal: minimize resource consumption (CPU, I/O, memory). Automated: optimizer creates execution plans. Manual: developer provides hints. Combined: both approaches effective.

"Good database design sets the foundation, but query optimization determines performance. A single poorly written query can bring down an entire system, while elegant queries enable scale." -- Database performance

Query Optimizer

Purpose

Receives: SQL statement (logical description). Produces: execution plan (physical operations). Goal: find lowest-cost plan. Automatic: transparent to user. Intelligence: statistics, heuristics, cost models.

Optimizer Types

Cost-based: estimates resource cost (CPU, I/O), chooses cheapest. Rule-based: applies heuristics (indexed column, join order). Hybrid: combines both approaches. Modern: mostly cost-based.

Planning Process

1. Parse SQL: validate syntax, identify operations2. Analyze: determine equivalent logical plans3. Estimate: cost each possible plan4. Choose: select minimum-cost plan5. Execute: run selected plan with actual data

Challenges

Incomplete statistics: estimates inaccurate. Complex queries: many possible plans (exponential). Dynamic data: statistics become stale. Trade-offs: compilation time vs optimization quality.

Execution Plans

Plan Structure

Tree: hierarchical operations. Leaf nodes: table scans (base data). Internal nodes: operators (filter, join, sort). Root: final result. Bottom-up: data flows from leaves to root.

Common Operators

Operator	Description	Cost
Table Scan	Read entire table sequentially	High (full scan)
Index Seek	Use index to find rows	Low (selective)
Filter	Apply WHERE conditions	Low (CPU only)
Nested Loop Join	Iterate outer, lookup inner	High (many lookups)
Hash Join	Build hash table, probe	Medium (memory)
Sort	Order results	High (N log N)

Reading Plans

EXPLAIN SELECT e.name, d.dept_nameFROM employees eJOIN departments d ON e.dept_id = d.dept_idWHERE e.salary > 50000;-- Output: operation sequence-- Table Scan: departments (1000 rows)-- Index Seek: employees.salary > 50000 (500 rows)-- Hash Join: on dept_id-- Result: 500 rows

Cost Estimation

Cost Model

I/O cost: disk reads/writes (dominant). CPU cost: processing (secondary). Memory: space for sorts, hash tables. Network: data transfer (distributed systems).

Formula

Total Cost = I/O_cost + CPU_cost + Memory_costI/O_cost = number_of_pages_accessed * page_read_costCPU_cost = number_of_rows_processed * cpu_per_row_cost

Estimation Inaccuracy

Correlated columns: independence assumption violated. Selectivity: estimated vs actual differs. Distribution: skewed data (non-uniform). Cardinality: row count estimates wrong.

Example

Estimate: query returns 100 rows (based on statistics). Actual: returns 10,000 rows. Optimizer chose plan for 100 rows (wrong). Plan: inefficient for 10,000 rows.

Query Rewriting

Equivalent Transformations

-- OriginalSELECT * FROM employees WHERE dept_id IN (SELECT dept_id FROM departments WHERE location = 'NY');-- Rewritten (JOIN)SELECT e.* FROM employees eJOIN departments d ON e.dept_id = d.dept_idWHERE d.location = 'NY';

Common Rewrites

Predicate pushdown: move WHERE earlier in plan (reduce intermediate rows). Join order: reorder joins (different algorithms, costs). Aggregation: early partial aggregation (reduce data). Subquery: convert to join (more optimization opportunities).

Unnesting Subqueries

-- Correlated subquery (slow)SELECT e.name FROM employees eWHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);-- Rewritten (fast)SELECT e.name FROM employees eJOIN (SELECT dept_id, AVG(salary) as avg_sal FROM employees GROUP BY dept_id) d ON e.dept_id = d.dept_id AND e.salary > d.avg_sal;

View Merging

-- ViewCREATE VIEW high_earners ASSELECT * FROM employees WHERE salary > 100000;-- Query using viewSELECT * FROM high_earners WHERE dept_id = 10;-- Merged (optimizer inlines view)SELECT * FROM employees WHERE salary > 100000 AND dept_id = 10;

Index Strategies

Index Selection

Selectivity: how much data filtered (0-100%). Low selectivity: index inefficient (better scan). High selectivity: index efficient (few rows). Cardinality: number of distinct values.

Composite Indexes

-- QuerySELECT * FROM orders WHERE customer_id = 5 AND order_date > '2024-01-01';-- Index (customer_id, order_date)-- Efficient: uses both columns for filtering-- Leading edge: customer_id must be first

Covering Indexes

-- Query requires columns: id, name, salarySELECT id, name, salary FROM employees WHERE dept_id = 10;-- Covering indexCREATE INDEX idx_covering ON employees(dept_id) INCLUDE (name, salary);-- Entire result in index (no table lookup needed)

Index Trade-offs

Advantage: faster reads (seek vs scan). Disadvantage: slower writes (insert/update/delete must update index). Space: extra disk storage. Maintenance: overhead for changes.

Join Optimization

Join Algorithms

Nested loop: simple, high cost (many I/O). Hash join: fast (large tables, equality). Sort-merge: efficient (both tables sorted). Index nested loop: use index (selective).

Join Order

-- Query: e JOIN d JOIN p-- Order 1: (e JOIN d) JOIN p-- Order 2: e JOIN (d JOIN p)-- Cost differs significantly (different intermediate sizes)

Selectivity-Based Ordering

Join most selective first: reduces intermediate rows. Example: filter WHERE before joining. Effect: cascade reduction (each join produces fewer rows).

Join Hints

-- Force specific algorithmSELECT /*+ USE_HASH(e d) */ * FROM employees eJOIN departments d ON e.dept_id = d.dept_id;-- Force join orderSELECT /*+ ORDERED */ * FROM employees e, departments dWHERE e.dept_id = d.dept_id;

Subquery Optimization

Correlated Subqueries

-- Problem: executes inner query for each rowSELECT e.name FROM employees eWHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);-- Rows: 1000, Inner queries: 1000 (expensive)

Optimization Techniques

Unnesting: convert to join (single pass). Aggregation: compute once, join. EXISTS: better than IN for subqueries. Materialization: compute subquery once, reuse.

IN vs EXISTS

-- IN with subquerySELECT * FROM employees WHERE dept_id IN (SELECT dept_id FROM departments);-- EXISTS (better for large subqueries)SELECT * FROM employees eWHERE EXISTS (SELECT 1 FROM departments d WHERE e.dept_id = d.dept_id);

Statistics and Histograms

Column Statistics

Cardinality: distinct values. Distribution: how values spread. Histograms: frequency per value range. Null count: NULL occurrences. Updated: periodically by DBMS.

Stale Statistics Impact

After bulk insert: statistics outdated. Query plan: based on old distribution. Result: suboptimal execution. Solution: refresh statistics after major changes.

Histogram Types

Equi-width: equal value ranges. Equi-height: equal row count per bucket. Compressed: value-only (no frequencies). Benefits: better selectivity estimation.

Maintenance

-- Refresh statisticsANALYZE TABLE employees COMPUTE STATISTICS;-- Specific columnANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS salary;-- View statisticsSELECT * FROM information_schema.statistics WHERE table_name = 'employees';

Analysis and Tools

EXPLAIN Statement

EXPLAIN SELECT * FROM employees WHERE salary > 50000;Output:- Operation type (Seq Scan, Index Seek, etc.)- Number of rows- Row width- Estimated cost- Filter conditions

EXPLAIN ANALYZE

EXPLAIN ANALYZE SELECT * FROM employees WHERE salary > 50000;Output:- Estimated vs actual rows- Estimated vs actual time- Buffering information- Identifies estimation errors

Query Profiling

SET PROFILING = 1;SELECT * FROM employees WHERE dept_id = 10;SHOW PROFILES;SHOW PROFILE FOR QUERY 1;-- Shows: sending data, query init, opening tables, etc.

Monitoring Tools

DBMS-native: built-in tools (SQL Server Query Store, MySQL Workbench). Third-party: specialized tools (SolarWinds, Redgate). Metrics: execution time, row count, wait events.

Tuning Techniques

Indexing Strategy

Analyze queries: find expensive operations. Add indexes: on frequently filtered/joined columns. Composite: consider together (leading edge matters). Remove: unused indexes (reduce write overhead).

Query Restructuring

Simplify: break complex queries into steps. Denormalize: cache computed values. Aggregate early: reduce rows before joins. Avoid functions: on columns (prevents index use).

Materialized Views

-- Expensive computationCREATE MATERIALIZED VIEW dept_salary_summary ASSELECT dept_id, COUNT(*) as emp_count, AVG(salary) as avg_salaryFROM employees GROUP BY dept_id;-- Query uses precomputed resultsSELECT * FROM dept_salary_summary WHERE dept_id = 10;

Batching and Pagination

-- Avoid: fetch all rowsSELECT * FROM employees;-- Batch: fetch in chunksSELECT * FROM employees LIMIT 100 OFFSET 0;SELECT * FROM employees LIMIT 100 OFFSET 100;

Common Anti-Patterns

Using Functions in WHERE

-- Bad: function prevents index useSELECT * FROM employees WHERE LOWER(name) = 'alice';-- Good: index worksSELECT * FROM employees WHERE name = 'Alice';

NOT IN with NULL

-- Bad: returns nothing (NULL in list)SELECT * FROM employees WHERE dept_id NOT IN (10, 20, NULL);-- Good: use NOT EXISTSSELECT * FROM employees eWHERE NOT EXISTS (SELECT 1 FROM departments d WHERE d.dept_id = e.dept_id);

Implicit Type Conversion

-- Bad: string number prevents indexSELECT * FROM employees WHERE emp_id = '100';-- Good: proper typeSELECT * FROM employees WHERE emp_id = 100;

Unnecessary Joins

-- Bad: joined but not neededSELECT e.name FROM employees eJOIN departments d ON e.dept_id = d.dept_id;-- Good: direct selectSELECT name FROM employees;

Overfetching

-- Bad: get all columnsSELECT * FROM large_table;-- Good: specific columnsSELECT id, name FROM large_table;

References

Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
Silberschatz, A., Korth, H. F., and Sudarshan, S. "Database System Concepts." McGraw-Hill, 6th edition, 2010.
Garcia-Molina, H., Ullman, J. D., and Widom, J. "Database Systems: The Complete Book." Pearson, 2nd edition, 2008.
Date, C. J., and Darwen, H. "Database Explorations." Trafford Publishing, 2010.
Celko, J. "SQL Performance Tuning." Sams Publishing, 2nd edition, 2013.