Introduction
SELECT: fundamental SQL statement retrieving data from tables. Declarative: specify what data needed, not how. Query optimizer: determines execution strategy. Core operation: all database queries build on SELECT.
Syntax simple: SELECT columns FROM table WHERE conditions ORDER BY columns. Semantically rich: filters, aggregations, joins, subqueries enable complex analysis. Foundation: essential for all database work.
Performance critical: unoptimized SELECT impacts database. Index usage, query plan optimization, cost estimation: determinant of speed. Understanding SELECT execution: key to database performance.
"SELECT is the query foundation: retrieve, transform, aggregate data. Declarative simplicity hides complex optimization. Mastery: understanding execution plans, indexes, and performance tuning." -- SQL essentials
Basic SELECT Statement
Syntax
SELECT column1, column2, ...
FROM table_name
WHERE conditions
ORDER BY columns
LIMIT/OFFSET count
Execution Order
Logical order (not parsing order): FROM (which table), WHERE (filter rows), SELECT (choose columns), ORDER BY (sort), LIMIT (restrict result count). Understanding order: predicts query behavior.
Example Query
SELECT employee_id, name, salary
FROM employees
WHERE department='IT' AND salary>50000
ORDER BY salary DESC
LIMIT 10;
Result: top 10 highest-paid IT employees
NULL Handling
NULL: unknown/missing value. Comparisons with NULL: always NULL (not true/false). Special: IS NULL, IS NOT NULL. Aggregates: ignore NULLs. Sorting: NULLs first or last (DBMS-dependent).
Case Sensitivity
Keywords: case-insensitive (SELECT, select, Select equivalent). Column names: usually case-insensitive (configuration-dependent). String values: case-sensitive (depends on collation).
Projection (Column Selection)
Column Selection
Specify columns: SELECT col1, col2, col3. Omit: SELECT * (all columns). Projection: reduces columns (may reduce I/O and memory). Selective: specify needed columns.
Column Aliases
SELECT
employee_id AS emp_id,
name AS employee_name,
salary * 1.1 AS projected_salary
FROM employees;
Expressions and Functions
Select columns, computed values, function results. Arithmetic: +, -, *, /. String functions: UPPER, LOWER, SUBSTRING. Date functions: DATEADD, DATEDIFF. Type casting: CAST, CONVERT.
Aggregate Functions
COUNT(), SUM(), AVG(), MIN(), MAX() reduce multiple rows to single value. Covered deeper in aggregation section. Basic: SELECT COUNT(*) FROM table.
CAST and Type Conversion
SELECT
CAST(salary AS DECIMAL(10, 2)),
CAST(hire_date AS VARCHAR(20))
FROM employees;
Filtering with WHERE
Basic Conditions
Equality: column=value. Comparison: >, <, >=, <=, <>. Logical: AND, OR, NOT. BETWEEN: column BETWEEN a AND b. IN: column IN (v1, v2, v3). LIKE: pattern matching.
WHERE Clause Evaluation
Evaluated per row. True: row included. False or NULL: row excluded. Short-circuit: AND stops if first false, OR stops if first true (optimization).
Example Conditions
WHERE salary > 50000
AND (department='IT' OR department='Finance')
AND hire_date >= '2020-01-01'
AND status IS NOT NULL;
LIKE Pattern Matching
% wildcard: any characters. _ wildcard: single character. Example: LIKE 'A%' (starts with A). Case sensitivity: DBMS-dependent. Performance: may not use indexes (depends on pattern).
IN vs. OR
IN (v1, v2, v3): cleaner for multiple values. OR: equivalent but verbose. IN: can be optimized better (index usage possible). Preference: use IN for multiple equality checks.
NULL Conditions
IS NULL: test for NULL. IS NOT NULL: test for non-NULL. Important: NULL!=NULL (always false). Comparison operators (=, <>) return NULL with NULL operands.
Sorting with ORDER BY
Single Column Sorting
ORDER BY salary ASC; -- Ascending (default)
ORDER BY salary DESC; -- Descending
Multi-Column Sorting
ORDER BY department ASC, salary DESC;
Result: rows grouped by department, within each department sorted by salary (highest first)
Sorting by Expression
ORDER BY YEAR(hire_date) DESC. Order by computed values or functions. Not restricted to selected columns.
NULL Ordering
NULLs first or last (DBMS-dependent). SQL Server: NULLs first in ASC. PostgreSQL: NULLs last in ASC. Explicit: NULLS FIRST / NULLS LAST (PostgreSQL, Oracle).
Performance Impact
Sorting: expensive (O(n log n) typical). Without index on sort column: requires sorting entire result. With index: may use index order (fast). Query optimizer: chooses strategy.
LIMIT and OFFSET
SELECT ... ORDER BY ... LIMIT 10; -- First 10 rows
SELECT ... ORDER BY ... LIMIT 10 OFFSET 20; -- Rows 21-30
DISTINCT and Deduplication
Remove Duplicates
SELECT DISTINCT department FROM employees;
Result: unique departments (each once, duplicates removed)
Multi-Column DISTINCT
SELECT DISTINCT department, salary FROM employees;
Result: unique (department, salary) combinations
Implementation
Hash-based: build hash table of seen values, output unique. Sort-based: sort, consecutive duplicates removed. Cost: O(n) hash or O(n log n) sort. Expensive: avoid if possible.
DISTINCT with ORDER BY
ORDER BY applied after DISTINCT. Can only sort by selected columns. Common: SELECT DISTINCT department FROM employees ORDER BY department.
Performance Consideration
DISTINCT: full result processing. Cannot use early termination (need all rows). Cost: linear in result size. Avoid: if not necessary.
COUNT(DISTINCT)
SELECT COUNT(DISTINCT department) FROM employees;
Result: number of unique departments (5 instead of 100 rows)
Aggregation Functions
Basic Aggregates
COUNT(*): number of rows. SUM(column): sum of values. AVG(column): average. MIN/MAX: minimum/maximum value. Reduce many rows to single value.
Example Aggregation
SELECT
COUNT(*) AS total_employees,
AVG(salary) AS avg_salary,
SUM(salary) AS total_payroll,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM employees;
NULL Handling in Aggregates
Aggregates ignore NULLs: COUNT(column) excludes NULLs (use COUNT(*) for row count). SUM(salary): sums non-NULL values. Average: SUM / COUNT (non-NULL).
COUNT Variations
COUNT(*): all rows (including NULLs). COUNT(column): non-NULL values. COUNT(DISTINCT column): unique non-NULL values. Difference: important for correct results.
FILTER Clause
SELECT
COUNT(*) AS all_employees,
COUNT(*) FILTER (WHERE salary > 50000) AS well_paid
FROM employees;
Window Functions
Advanced: aggregates per group/window. SUM(...) OVER (...). Running totals, moving averages. Separate section (covered in advanced SQL).
Grouping with GROUP BY
Basic Grouping
SELECT department, COUNT(*) AS num_employees
FROM employees
GROUP BY department;
How GROUP BY Works
Partition rows: by grouped column(s). Within each group: apply aggregates. Result: one row per group (each group reduced by aggregate). Powerful: summarization.
Multiple Columns
SELECT department, salary_level, COUNT(*) AS count
FROM employees
GROUP BY department, salary_level;
Restrictions with GROUP BY
SELECT list: grouped columns or aggregates only. Cannot select non-grouped columns without aggregate (dependent on group, ambiguous). SQL standard (MySQL strict).
NULL in GROUP BY
GROUP BY department: NULLs grouped together. One group for all NULLs. Handled consistently. Useful: identify missing department assignments.
GROUP BY Expressions
SELECT YEAR(hire_date) AS hire_year, COUNT(*) AS count
FROM employees
GROUP BY YEAR(hire_date);
Filtering Groups with HAVING
GROUP BY vs. WHERE
WHERE: filters rows before grouping. HAVING: filters groups after aggregation. Different purposes: row-level vs. group-level filtering.
Example
SELECT department, AVG(salary) AS avg_sal
FROM employees
WHERE hire_date >= '2020-01-01' -- Filter rows
GROUP BY department
HAVING AVG(salary) > 60000; -- Filter groups
HAVING Clause Logic
After GROUP BY: each group created. Aggregates computed per group. HAVING evaluated: true groups included, false groups excluded. Final: result set with matching groups.
Complex HAVING Conditions
HAVING COUNT(*) > 5 AND AVG(salary) > 50000
Performance Implication
WHERE reduces rows early (faster). HAVING: processes all groups (slower). Use WHERE when possible. HAVING: for aggregate-based conditions.
Aliases in HAVING
Some systems: allow aliases (SELECT AVG(salary) AS avg_sal... HAVING avg_sal > 50000). Standard SQL: requires full expression. Portability: use expression.
Joins (Basic Overview)
Basic Join Syntax
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
Join Types
INNER JOIN: matching rows only. LEFT JOIN: left table all rows + matching right. RIGHT JOIN: right table all rows + matching left. FULL OUTER: all rows from both. CROSS JOIN: Cartesian product.
Join Condition
ON clause: specifies join predicate. Example: ON employees.dept_id = departments.dept_id. Can be complex: ON a.id=b.id AND a.type=b.type.
Table Aliases
Shorten names: FROM employees AS e, or FROM employees e. Required when joining same table twice. Improves readability.
Join Performance
Join strategy: nested loop, hash join, sort-merge. Optimizer: chooses based on table size, indexes. Indexes: critical for join performance. Detailed in query optimization section.
Multiple Joins
SELECT e.name, d.dept_name, m.name AS manager
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
LEFT JOIN employees m ON e.manager_id = m.emp_id;
Subqueries
Scalar Subquery
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
IN Subquery
SELECT name
FROM employees
WHERE dept_id IN (SELECT dept_id FROM departments WHERE location='NYC');
EXISTS Subquery
SELECT d.dept_name
FROM departments d
WHERE EXISTS (SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id);
Subquery in FROM (Derived Table)
SELECT dept, avg_sal FROM (
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
) AS dept_stats
WHERE avg_sal > 50000;
Correlated Subqueries
Subquery references outer query. Executed per row: expensive. Example: WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id=e1.dept_id).
Performance Consideration
Subqueries: often less efficient than joins. Optimizer: may convert to join. Correlated: very expensive (per-row execution). Avoid if possible: prefer joins.
Query Execution and Optimization
Query Optimizer
Parses SQL: builds abstract syntax tree. Generates plans: multiple valid execution strategies. Estimates cost: chooses lowest. Plan: execution blueprint (operators, order).
Execution Plan Steps
Parse: syntax check. Compile: query plan generation. Optimize: choose best execution. Execute: run plan. Fetch: return results. Cost: parsing/optimization negligible for repeated queries.
Index Usage
Indexed column WHERE conditions: index scan (fast). Non-indexed: table scan (slow). Multiple indexes: optimizer chooses best. No index: full table scan inevitable.
Join Order
Multiple tables: join order matters. Small table first: hash join beneficial. Indexed column: use index. Optimizer: picks best order (not always intuitive).
EXPLAIN Plans
EXPLAIN SELECT ... FROM ...;
Output: operation sequence, estimated rows, cost, index usage
Query Tuning
Examine plan: identify bottlenecks. Add indexes: for WHERE/JOIN columns. Rewrite queries: expose optimizer opportunities. Iterative: measure improvements.
References
- Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
- Garcia-Molina, H., Ullman, J. D., and Widom, J. "Database Systems: The Complete Book." Pearson, 2nd edition, 2008.
- Silberschatz, A., Korth, H. F., and Sudarshan, S. "Database System Concepts." McGraw-Hill, 6th edition, 2010.
- Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly Media, 2017.
- ISO/IEC 9075-1:2016 Information Technology - Database Languages - SQL - Part 1: Framework.