Introduction
Subquery: query within query. Nested: inner query runs first. Results: feed outer query. Flexibility: complex logic. Performance: potential issues (depends on optimizer). Alternative to joins: sometimes better, sometimes worse.
Types: scalar (single value), IN/EXISTS (membership), correlated (references outer), derived table (FROM subquery).
"Subqueries enable complex queries through nesting. Readability: sometimes better than joins. Performance: must verify (optimizer dependent). Use judiciously: measure actual cost." -- Query design
Scalar Subqueries
Definition
Returns single value: one row, one column. Used: in WHERE, SELECT, FROM clause. Syntax: (SELECT column FROM table WHERE condition).
Example
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
Execution
Inner query first: compute average salary. Outer query: filter using value. Result: employees above average.
Multiple Scalars
SELECT
name,
salary,
(SELECT AVG(salary) FROM employees) AS company_avg,
(SELECT MAX(salary) FROM employees) AS max_salary
FROM employees;
Risk
Multiple rows: error. Always verify: subquery returns single row.
IN Subqueries
Definition
Returns list: multiple rows. Checks: value in list. Syntax: WHERE column IN (SELECT ...).
Example
SELECT name
FROM employees
WHERE dept_id IN (SELECT dept_id FROM departments WHERE region='East');
Semantics
Equivalent: WHERE dept_id = 10 OR dept_id = 20 OR dept_id = 30 (if subquery returns 10, 20, 30).
NOT IN
WHERE dept_id NOT IN (SELECT dept_id FROM departments WHERE region='West');
Null Handling
NOT IN with NULLs: all rows rejected (NULL comparisons). Safer: use NOT EXISTS instead.
EXISTS Subqueries
Definition
Tests: existence (true/false). Doesn't return values. Efficient: stops after finding one match.
Example
SELECT dept_name
FROM departments d
WHERE EXISTS (SELECT 1 FROM employees WHERE dept_id = d.dept_id);
Semantics
For each department: check if any employee in that department. True: include department. False: exclude.
NOT EXISTS
WHERE NOT EXISTS (SELECT 1 FROM employees WHERE dept_id = d.dept_id);
Efficiency
Stops early: one match found, returns true. Better: than COUNT (which counts all). Semantic: clear intent.
Derived Tables (FROM Subqueries)
Definition
Subquery in FROM clause: treated as table. Alias required. Enables: multi-step queries, complex joins.
Example
SELECT dept, avg_sal FROM (
SELECT dept_id AS dept, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept_id
) AS dept_stats
WHERE avg_sal > 60000;
Advantage
Readability: breaks complex query. Multiple steps: logical separation. Maintainability: easier understanding.
Materialization
Subquery: executed, result stored (temp). Joined with outer: memory overhead. Optimizer: decides materialization strategy.
Performance Impact
Optimizer Behavior
Convert: subquery to JOIN (if beneficial). Inlining: move subquery into main query. Depends: DBMS, query structure. Unpredictable: test empirically.
Correlated Cost
Per-row execution: expensive. Repetition: subquery runs repeatedly. Avoidable: use JOIN instead (set operation, faster).
Scalar vs. IN
IN: set operation (efficient). NOT IN with NULLs: problematic. EXISTS: efficient (stops early). Choose: based on logic and performance.
EXPLAIN Analysis
EXPLAIN SELECT ... FROM ... WHERE EXISTS ...;
Check: subquery executed once or per row?
Nested loop: indicates per-row (expensive)
Semi-join: indicates optimization applied
Subqueries vs. Joins
Equivalence
Many subqueries: expressible as JOINs. JOINs: often more efficient. Optimizer: may convert automatically. Semantics: same result (usually).
Example Comparison
Subquery:
SELECT name FROM employees
WHERE dept_id IN (SELECT dept_id FROM departments WHERE region='East');
Join:
SELECT DISTINCT e.name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE d.region = 'East';
Performance Implications
JOIN: typically faster (set operations optimized). Subquery: may be slower (depends on optimizer). Test: benchmark both.
Readability
Subquery: sometimes clearer (nested logic). JOIN: sometimes clearer (explicit relationships). Choose: based on understandability.
Recommendation
Prefer JOINs: generally better performance. Use subqueries: when necessary or for readability. Measure: verify actual performance.
Query Optimization
Rewriting Strategies
Subquery to JOIN: if equivalent. Derived table: simplify (remove unnecessary columns). Remove: unused subqueries.
Index Usage
Subquery filter: can use index (on filtered column). Correlated: index on joining column (speeds lookup). Analyze: execution plan.
Simplification
Complex nested subquery: break into CTE (more readable)
Multiple subqueries: consolidate where possible
Redundant: eliminate duplicate logic
Common Table Expressions (CTEs)
Definition
Named subquery: reusable. WITH clause: define before main query. Improves readability: named intermediate results.
Example
WITH dept_stats AS (
SELECT dept_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept_id
)
SELECT e.name, e.salary
FROM employees e
JOIN dept_stats ds ON e.dept_id = ds.dept_id
WHERE e.salary > ds.avg_sal;
Advantages
Readability: named intermediate steps. Reusability: used multiple times. Clarity: complex queries simplified.
Recursive CTEs
Advanced: hierarchical data (trees). Complex: requires separate study.
Practical Examples
Find High Earners
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
Employees in Specific Regions
SELECT name
FROM employees
WHERE dept_id IN (
SELECT dept_id FROM departments WHERE region IN ('East', 'West')
);
Departments with Employees
SELECT dept_name
FROM departments
WHERE EXISTS (SELECT 1 FROM employees WHERE dept_id = departments.dept_id);
Complex Multi-Step
WITH sales_summary AS (
SELECT product_id, SUM(amount) AS total FROM sales GROUP BY product_id
)
SELECT p.name, s.total
FROM products p
JOIN sales_summary s ON p.id = s.product_id
WHERE s.total > 100000;
References
- Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
- ISO/IEC 9075-1:2016 Information Technology - Database Languages - SQL.